Defend against single-page TOC misclassification dropping all content#267
Open
SuperMarioYL wants to merge 1 commit intoVectifyAI:mainfrom
Open
Defend against single-page TOC misclassification dropping all content#267SuperMarioYL wants to merge 1 commit intoVectifyAI:mainfrom
SuperMarioYL wants to merge 1 commit intoVectifyAI:mainfrom
Conversation
When a single-page document is passed to PageIndex and toc_detector_single_page returns "yes" (a known failure mode for pages of numbered policy / rule / statute content), find_toc_pages returns [0]. check_toc then enters the process_toc_with_page_numbers path with start_page_index = toc_page_list[-1] + 1 = 1, so the for-loop at process_toc_with_page_numbers iterates over range(1, min(1+toc_check_page_num, 1)) = range(1, 1) — empty — yielding main_content="" and silently dropping the entire document. Fix: in check_toc, after find_toc_pages, treat "the TOC covers the whole document" as a misclassification and fall through to the no-toc path so the page itself becomes content. A real TOC always points to content located elsewhere; if there is no content elsewhere, the detection was wrong by definition. Also tightened the toc_detector_single_page prompt to make the content-vs-directory distinction explicit, with examples of structured content (policies, regulations, rules, statutes, contracts, ordinances) that resemble TOCs but are not. Refs VectifyAI#203.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When a single-page document (or any short doc where the TOC detector misfires on every page it inspects) is run through PageIndex, the entire document content is silently dropped.
Reproduction trace through the current code on a 1-page PDF:
tree_parsercallscheck_toc(page_list, opt).check_toccallsfind_toc_pages(start_page_index=0, page_list, opt).find_toc_pagesruns once withi=0. Iftoc_detector_single_pagereturns"yes"— a real failure mode on pages of numbered policy / rule / statute content —toc_page_listbecomes[0]and the loop exits becausei=1 == len(page_list).Back in
check_toc, theelsebranch runs and returns{toc_page_list: [0], page_index_given_in_toc: 'yes', toc_content: <fake-extracted>}.tree_parserenters theprocess_toc_with_page_numberspath. Insideprocess_toc_with_page_numbers:main_contentstays"", all downstream extractors run on empty input, and the document's only page is discarded.This matches the data-loss behavior reported in #203 (single-page policy / regulation / memo input → zero content in output).
Fix
A real table of contents always points to content located beyond it. So if
find_toc_pagesreturns a list whose last index is at or past the end of the document, the detection must be wrong — by construction there is nothing for the TOC to point at. In that degenerate casecheck_tocnow falls back to returning the no-toc verdict, which routes the document throughprocess_no_tocand preserves the page as content.This is a defensive guard at the smallest blast radius; it does not change behavior for any document where the TOC pages don't span the whole input.
I also tightened the
toc_detector_single_pageprompt per the issue's request: it now explains that a TOC is a directory of references to content located elsewhere, and that pages whose numbered headings are followed on the same page by their own substantive body text are content, not a TOC. The list of typical false-positive categories (policies, regulations, rules, statutes, contracts, ordinances, articles) is named explicitly.Verification
The repo currently has no test infrastructure, so this is verified by code-trace reasoning (above) plus a self-contained reproduction script that inlines both
check_tocversions and stubs the LLM detector to always return"yes"(the failure mode):repro_203.py — click to expand
Output:
I haven't added a
tests/directory because the repo doesn't have one and I didn't want to introduce a new dependency / convention that wasn't already there — happy to wire in pytest in a follow-up if that's useful.Notes
pageindex/page_index.py. No other files touched.Fixes #203.