Fix handling of empty lists and malformed PDF dictionary values #3438

pbottine · 2025-11-26T20:29:48Z

This fixes issue #12 where certain malformed PDFs would cause "List index out of range" errors during parsing.

Changes to PDFList.load():

Handle empty lists by returning zero-length wrapper at offset 0
Filter items to only those with offset information before calculating bounds
Log warnings when items lack position data instead of using incorrect defaults
Change from @staticmethod to @classmethod for better conventions
Add comprehensive docstring explaining edge case behavior

Changes to parse_object() dictionary handling:

Fix logic ordering: check isinstance(value, list) before checking emptiness This prevents skipping falsy but valid values like 0, False, or empty strings
Gracefully skip empty lists in dictionaries (log debug message)
Catch ValueError from PDFList.load() for truly malformed lists
Log warnings instead of raising exceptions for unexpected values
Continue parsing to extract maximum data from malformed PDFs
Update dictionary to keep it self-consistent after wrapping lists

Testing:

Added comprehensive unit tests in tests/test_pdf.py covering:
- Empty lists
- Lists with/without offset information
- Mixed offset scenarios
- Empty lists in dictionaries
- Malformed list values
- Unexpected dictionary values
- Preservation of falsy but valid values (0, False)
All 8 new tests pass
All 23 existing unit tests continue to pass
PDF parsing verified with testdata/javascript.pdf

This builds on the approach from PR #3426 by @mrscottyrose with corrections to logic ordering, offset calculation, and comprehensive test coverage.

Fixes #12

cc: @smoelius

🤖 Generated with Claude Code

@staticmethod

This fixes issue #12 where certain malformed PDFs would cause "List index out of range" errors during parsing. Changes to PDFList.load(): - Handle empty lists by returning zero-length wrapper at offset 0 - Filter items to only those with offset information before calculating bounds - Log warnings when items lack position data instead of using incorrect defaults - Change from @staticmethod to @classmethod for better conventions - Add comprehensive docstring explaining edge case behavior Changes to parse_object() dictionary handling: - Fix logic ordering: check isinstance(value, list) before checking emptiness This prevents skipping falsy but valid values like 0, False, or empty strings - Gracefully skip empty lists in dictionaries (log debug message) - Catch ValueError from PDFList.load() for truly malformed lists - Log warnings instead of raising exceptions for unexpected values - Continue parsing to extract maximum data from malformed PDFs - Update dictionary to keep it self-consistent after wrapping lists Testing: - Added comprehensive unit tests in tests/test_pdf.py covering: * Empty lists * Lists with/without offset information * Mixed offset scenarios * Empty lists in dictionaries * Malformed list values * Unexpected dictionary values * Preservation of falsy but valid values (0, False) - All 8 new tests pass - All 23 existing unit tests continue to pass - PDF parsing verified with testdata/javascript.pdf This builds on the approach from PR #3426 by @mrscottyrose with corrections to logic ordering, offset calculation, and comprehensive test coverage. Fixes #12 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

pbottine requested a review from ESultanik as a code owner November 26, 2025 20:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix handling of empty lists and malformed PDF dictionary values #3438

Fix handling of empty lists and malformed PDF dictionary values #3438

Uh oh!

pbottine commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix handling of empty lists and malformed PDF dictionary values #3438

Are you sure you want to change the base?

Fix handling of empty lists and malformed PDF dictionary values #3438

Uh oh!

Conversation

pbottine commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants