Skip to content

Defend against single-page TOC misclassification dropping all content#267

Open
SuperMarioYL wants to merge 1 commit intoVectifyAI:mainfrom
SuperMarioYL:fix/single-page-toc-misclassification
Open

Defend against single-page TOC misclassification dropping all content#267
SuperMarioYL wants to merge 1 commit intoVectifyAI:mainfrom
SuperMarioYL:fix/single-page-toc-misclassification

Conversation

@SuperMarioYL
Copy link
Copy Markdown

Problem

When a single-page document (or any short doc where the TOC detector misfires on every page it inspects) is run through PageIndex, the entire document content is silently dropped.

Reproduction trace through the current code on a 1-page PDF:

  1. tree_parser calls check_toc(page_list, opt).

  2. check_toc calls find_toc_pages(start_page_index=0, page_list, opt).

  3. find_toc_pages runs once with i=0. If toc_detector_single_page returns "yes" — a real failure mode on pages of numbered policy / rule / statute content — toc_page_list becomes [0] and the loop exits because i=1 == len(page_list).

  4. Back in check_toc, the else branch runs and returns {toc_page_list: [0], page_index_given_in_toc: 'yes', toc_content: <fake-extracted>}.

  5. tree_parser enters the process_toc_with_page_numbers path. Inside process_toc_with_page_numbers:

    start_page_index = toc_page_list[-1] + 1   # = 1
    main_content = ""
    for page_index in range(start_page_index,
                            min(start_page_index + toc_check_page_num,
                                len(page_list))):     # range(1, 1) → empty
        main_content += ...

    main_content stays "", all downstream extractors run on empty input, and the document's only page is discarded.

This matches the data-loss behavior reported in #203 (single-page policy / regulation / memo input → zero content in output).

Fix

A real table of contents always points to content located beyond it. So if find_toc_pages returns a list whose last index is at or past the end of the document, the detection must be wrong — by construction there is nothing for the TOC to point at. In that degenerate case check_toc now falls back to returning the no-toc verdict, which routes the document through process_no_toc and preserves the page as content.

if toc_page_list[-1] + 1 >= len(page_list):
    print('toc covers the entire document (likely misclassification); '
          'falling back to no-toc')
    return {'toc_content': None, 'toc_page_list': [],
            'page_index_given_in_toc': 'no'}

This is a defensive guard at the smallest blast radius; it does not change behavior for any document where the TOC pages don't span the whole input.

I also tightened the toc_detector_single_page prompt per the issue's request: it now explains that a TOC is a directory of references to content located elsewhere, and that pages whose numbered headings are followed on the same page by their own substantive body text are content, not a TOC. The list of typical false-positive categories (policies, regulations, rules, statutes, contracts, ordinances, articles) is named explicitly.

Verification

The repo currently has no test infrastructure, so this is verified by code-trace reasoning (above) plus a self-contained reproduction script that inlines both check_toc versions and stubs the LLM detector to always return "yes" (the failure mode):

repro_203.py — click to expand
def find_toc_pages_stub(start_page_index, page_list, opt, logger=None):
    last_page_is_yes, toc_page_list, i = False, [], start_page_index
    while i < len(page_list):
        if i >= opt.toc_check_page_num and not last_page_is_yes:
            break
        detected_result = 'yes'  # simulate the failure mode in #203
        if detected_result == 'yes':
            toc_page_list.append(i); last_page_is_yes = True
        elif detected_result == 'no' and last_page_is_yes:
            break
        i += 1
    return toc_page_list

def toc_extractor_stub(page_list, toc_page_list, model):
    return {'toc_content': '1. Article 1\n2. Article 2',
            'page_index_given_in_toc': 'yes'}

def check_toc_legacy(page_list, opt=None):
    toc_page_list = find_toc_pages_stub(0, page_list, opt)
    if len(toc_page_list) == 0:
        return {'toc_content': None, 'toc_page_list': [], 'page_index_given_in_toc': 'no'}
    toc_json = toc_extractor_stub(page_list, toc_page_list, opt.model)
    return {'toc_content': toc_json['toc_content'],
            'toc_page_list': toc_page_list,
            'page_index_given_in_toc': toc_json['page_index_given_in_toc']}

def check_toc_patched(page_list, opt=None):
    toc_page_list = find_toc_pages_stub(0, page_list, opt)
    if len(toc_page_list) == 0:
        return {'toc_content': None, 'toc_page_list': [], 'page_index_given_in_toc': 'no'}
    if toc_page_list[-1] + 1 >= len(page_list):    # the new guard
        return {'toc_content': None, 'toc_page_list': [], 'page_index_given_in_toc': 'no'}
    toc_json = toc_extractor_stub(page_list, toc_page_list, opt.model)
    return {'toc_content': toc_json['toc_content'],
            'toc_page_list': toc_page_list,
            'page_index_given_in_toc': toc_json['page_index_given_in_toc']}

class _Opt: toc_check_page_num = 8; model = 'stub'

def simulate_main_content(result, page_list, opt):
    tpl = result['toc_page_list']
    if not tpl: return None
    start = tpl[-1] + 1
    return "".join(f"<physical_index_{i+1}>\n{page_list[i][0]}\n<physical_index_{i+1}>\n\n"
                   for i in range(start, min(start + opt.toc_check_page_num, len(page_list))))

page_list = [("Article 1. Hours of operation.\n"
              "The office shall be open 9am-5pm.\n"
              "Article 2. Holidays.\n"
              "The office shall be closed on federal holidays.\n", {})]

for label, fn in [("LEGACY", check_toc_legacy), ("PATCHED", check_toc_patched)]:
    print(f"=== {label} ===")
    r = fn(page_list, _Opt())
    print(f"check_toc returned: {r}")
    mc = simulate_main_content(r, page_list, _Opt())
    if mc is None:
        print("-> falls through to no-toc, process_no_toc receives full page as content")
    else:
        print(f"-> process_toc_with_page_numbers, main_content: {len(mc)} chars"
              + ("    [BUG: document dropped]" if len(mc) == 0 else ""))
    print()

Output:

=== LEGACY ===
check_toc returned: {'toc_content': '1. Article 1\n2. Article 2', 'toc_page_list': [0], 'page_index_given_in_toc': 'yes'}
-> process_toc_with_page_numbers, main_content: 0 chars    [BUG: document dropped]

=== PATCHED ===
check_toc returned: {'toc_content': None, 'toc_page_list': [], 'page_index_given_in_toc': 'no'}
-> falls through to no-toc, process_no_toc receives full page as content

I haven't added a tests/ directory because the repo doesn't have one and I didn't want to introduce a new dependency / convention that wasn't already there — happy to wire in pytest in a follow-up if that's useful.

Notes

  • Net diff: +33 / −8 in pageindex/page_index.py. No other files touched.
  • Compatible with the in-flight comma fix in fix: add missing commas in LLM prompts to ensure valid JSON output #258 — that PR adds a trailing comma at line 112; this PR rewrites the body of the prompt above that line and (intentionally) leaves the JSON-format spec untouched, so a merge should be straightforward in either order.

Fixes #203.

When a single-page document is passed to PageIndex and toc_detector_single_page
returns "yes" (a known failure mode for pages of numbered policy / rule /
statute content), find_toc_pages returns [0]. check_toc then enters the
process_toc_with_page_numbers path with start_page_index = toc_page_list[-1]
+ 1 = 1, so the for-loop at process_toc_with_page_numbers iterates over
range(1, min(1+toc_check_page_num, 1)) = range(1, 1) — empty — yielding
main_content="" and silently dropping the entire document.

Fix: in check_toc, after find_toc_pages, treat "the TOC covers the whole
document" as a misclassification and fall through to the no-toc path so
the page itself becomes content. A real TOC always points to content
located elsewhere; if there is no content elsewhere, the detection was
wrong by definition.

Also tightened the toc_detector_single_page prompt to make the
content-vs-directory distinction explicit, with examples of structured
content (policies, regulations, rules, statutes, contracts, ordinances)
that resemble TOCs but are not.

Refs VectifyAI#203.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Single-page documents incorrectly identified as TOC, skipping all content

1 participant