Skip to content

Conversation

@pbottine
Copy link

This fixes issue #12 where certain malformed PDFs would cause "List index out of range" errors during parsing.

Changes to PDFList.load():

  • Handle empty lists by returning zero-length wrapper at offset 0
  • Filter items to only those with offset information before calculating bounds
  • Log warnings when items lack position data instead of using incorrect defaults
  • Change from @staticmethod to @classmethod for better conventions
  • Add comprehensive docstring explaining edge case behavior

Changes to parse_object() dictionary handling:

  • Fix logic ordering: check isinstance(value, list) before checking emptiness This prevents skipping falsy but valid values like 0, False, or empty strings
  • Gracefully skip empty lists in dictionaries (log debug message)
  • Catch ValueError from PDFList.load() for truly malformed lists
  • Log warnings instead of raising exceptions for unexpected values
  • Continue parsing to extract maximum data from malformed PDFs
  • Update dictionary to keep it self-consistent after wrapping lists

Testing:

  • Added comprehensive unit tests in tests/test_pdf.py covering:
    • Empty lists
    • Lists with/without offset information
    • Mixed offset scenarios
    • Empty lists in dictionaries
    • Malformed list values
    • Unexpected dictionary values
    • Preservation of falsy but valid values (0, False)
  • All 8 new tests pass
  • All 23 existing unit tests continue to pass
  • PDF parsing verified with testdata/javascript.pdf

This builds on the approach from PR #3426 by @mrscottyrose with corrections to logic ordering, offset calculation, and comprehensive test coverage.

Fixes #12

cc: @smoelius

🤖 Generated with Claude Code

This fixes issue #12 where certain malformed PDFs would cause
"List index out of range" errors during parsing.

Changes to PDFList.load():
- Handle empty lists by returning zero-length wrapper at offset 0
- Filter items to only those with offset information before calculating bounds
- Log warnings when items lack position data instead of using incorrect defaults
- Change from @staticmethod to @classmethod for better conventions
- Add comprehensive docstring explaining edge case behavior

Changes to parse_object() dictionary handling:
- Fix logic ordering: check isinstance(value, list) before checking emptiness
  This prevents skipping falsy but valid values like 0, False, or empty strings
- Gracefully skip empty lists in dictionaries (log debug message)
- Catch ValueError from PDFList.load() for truly malformed lists
- Log warnings instead of raising exceptions for unexpected values
- Continue parsing to extract maximum data from malformed PDFs
- Update dictionary to keep it self-consistent after wrapping lists

Testing:
- Added comprehensive unit tests in tests/test_pdf.py covering:
  * Empty lists
  * Lists with/without offset information
  * Mixed offset scenarios
  * Empty lists in dictionaries
  * Malformed list values
  * Unexpected dictionary values
  * Preservation of falsy but valid values (0, False)
- All 8 new tests pass
- All 23 existing unit tests continue to pass
- PDF parsing verified with testdata/javascript.pdf

This builds on the approach from PR #3426 by @mrscottyrose with
corrections to logic ordering, offset calculation, and comprehensive
test coverage.

Fixes #12

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@pbottine pbottine requested a review from ESultanik as a code owner November 26, 2025 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

'List index out of range' on one of Ange's POC files

2 participants