Skip to content

get_page_content over-collects on Markdown when given a comma-separated page list #279

@akhilesharora

Description

@akhilesharora

get_page_content claims to accept three forms: '5-7', '3,8', or '12' (see docstring at pageindex/retrieve.py:111-119). For PDFs it works as advertised. For Markdown docs '3,8' is treated as the inclusive range [3,8] and pulls in every heading whose line_num lands between them.

Repro, no LLM, no real PDF:

import json
from pageindex.retrieve import get_page_content
docs = {
  'md':  {'type':'md','structure':[
    {'line_num':5,'text':'L5','nodes':[]},
    {'line_num':10,'text':'L10','nodes':[]},
    {'line_num':50,'text':'L50','nodes':[]},
    {'line_num':100,'text':'L100','nodes':[]}]},
  'pdf': {'type':'pdf','pages':[
    {'page':5,'content':'P5'},{'page':10,'content':'P10'},
    {'page':50,'content':'P50'},{'page':100,'content':'P100'}]}}
print(json.loads(get_page_content(docs, 'md',  '5,100')))
print(json.loads(get_page_content(docs, 'pdf', '5,100')))

Got:

  • md -> pages 5, 10, 50, 100
  • pdf -> pages 5, 100

Want both to return [5, 100] like the docstring suggests.

The over-collection happens in _get_md_page_content at pageindex/retrieve.py:56-76, which does min(page_nums)/max(page_nums) and matches everything in that window. _parse_pages already returns a discrete sorted list, so the loss happens entirely in the Markdown helper.

Noticed this while looking at how the agentic demo (examples/agentic_vectorless_rag_demo.py) calls get_page_content. On long Markdown docs a comma-list quietly pulls in unrelated sections, which inflates token use.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions