Skip to content

Fix direct image extraction dispatch#1996

Open
jioffe502 wants to merge 4 commits into
NVIDIA:mainfrom
jioffe502:investigate/image_extraction_bug
Open

Fix direct image extraction dispatch#1996
jioffe502 wants to merge 4 commits into
NVIDIA:mainfrom
jioffe502:investigate/image_extraction_bug

Conversation

@jioffe502
Copy link
Copy Markdown
Collaborator

@jioffe502 jioffe502 commented May 7, 2026

Summary

  • Default GraphIngestor.extract() to extraction_mode="auto" so supported non-PDF inputs are routed through MultiTypeExtractOperator by extension.
  • Preserve extraction_mode="pdf" as the explicit dedicated PDF/document graph path.
  • Add coverage for BMP/TIFF/TIF page image materialization and mixed PDF+BMP auto dispatch.
  • Add graph_ingestor.md documenting GraphIngestor usage, run modes, extraction modes, supported auto-dispatch groups, split configuration, and post-extraction stages.

Testing

  • python -m pytest nemo_retriever/tests/test_ingest_interface.py -q
  • Full JP20 pipeline smoke-tested with the auto dispatch path.

@jioffe502 jioffe502 requested review from a team as code owners May 7, 2026 19:00
@jioffe502 jioffe502 requested a review from jperez999 May 7, 2026 19:00
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 7, 2026

Greptile Summary

This PR changes GraphIngestor.extract() to default extraction_mode=\"auto\" (previously \"pdf\") so that non-PDF file types such as BMP and TIFF are automatically routed through MultiTypeExtractOperator by extension, while extraction_mode=\"pdf\" remains available as an explicit override. A new graph_ingestor.md reference document and two new tests covering direct image materialization and mixed-corpus dispatch are also included.

  • graph_ingestor.py: Single default-value change on extract() with an updated docstring and a minor comment clarification in _build_graph.
  • test_ingest_interface.py: Updates the existing test_extract_unified_defaults assertion and adds test_extract_default_materializes_direct_image_page_image (BMP/TIFF parametrized) and test_extract_default_auto_dispatches_mixed_supported_inputs.
  • graph_ingestor.md: New reference doc covering run modes, extraction modes, ExtractParams fields, split_config, and post-extraction stages.

Confidence Score: 5/5

The change is mechanically small — one default-value swap, a docstring update, and two new tests — and the documentation is thorough about the behavioral difference between 'auto' and 'pdf' modes.

The core logic change is confined to a single parameter default; the auto-dispatch path through MultiTypeExtractOperator existed before this PR and is not being restructured. The new tests exercise BMP/TIFF materialization and mixed-corpus routing with the image branch fully verified. No new unsafe patterns or data-loss paths were introduced by this diff.

Both test functions in test_ingest_interface.py deserve a closer look — the PDF row assertions in the mixed-dispatch test are thin, and the monkeypatching targets private methods two levels below the public boundary.

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/graph_ingestor.py Default extraction_mode on extract() changed from "pdf" to "auto"; docstring and one comment updated — no other logic changes.
nemo_retriever/tests/test_ingest_interface.py Updated default-mode assertion and two new tests; both new tests monkeypatch private class methods two levels below the public boundary without spec-checking.
nemo_retriever/src/nemo_retriever/graph_ingestor.md New reference document; well-structured, accurate descriptions of run modes, extraction modes, dispatch groups, and post-extraction stages.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["GraphIngestor.extract()\nextraction_mode='auto' (NEW default)"] --> B{MultiTypeExtractOperator\ndispatch by extension}
    B -->|".pdf / .docx / .pptx"| C["DocToPdfConversion → PDFSplit\n→ PDF Extraction → Detection\n→ Optional OCR/Table/Chart"]
    B -->|".png / .jpg / .bmp / .tiff / .tif"| D["ImageLoadActor → Detection\n→ Optional OCR/Table/Chart"]
    B -->|".txt"| E["TxtSplitActor"]
    B -->|".html"| F["HtmlSplitActor"]
    B -->|".mp3 / .wav"| G["MediaChunkActor → ASRActor"]
    B -->|".mp4 / .mov / .mkv"| H["Video Frame Extraction\n+ Optional ASR + AV Fusion"]
    A2["GraphIngestor.extract(extraction_mode='pdf')\n(explicit override)"] --> C2["PDF/document graph\ndirectly — no extension dispatch"]
Loading
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
nemo_retriever/tests/test_ingest_interface.py:165-168
**PDF branch goes unverified in the mixed-dispatch test**

The test confirms both file paths appear in the result and checks the image row thoroughly, but never asserts anything about the PDF row's content (e.g. `text`, `metadata["error"]`, or `page_image` being `None`). Because `_run_pdf_pipeline` is mocked to return a hand-crafted DataFrame, a regression in how `MultiTypeExtractOperator` merges or passes through the PDF branch output would leave these fields silently wrong while this test still passes. Adding at minimum `pdf_row["text"] == "pdf text"` and `pdf_row["metadata"]["error"] is None` would close that gap.

Reviews (4): Last reviewed commit: "Merge branch 'main' into investigate/ima..." | Re-trigger Greptile

*,
split_config: dict[str, Any] | None = None,
extraction_mode: str = "pdf",
extraction_mode: str = "auto",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Changing the default from "pdf" to "auto" silently changes behavior for every existing caller who invokes .extract() without an explicit extraction_mode. Prior to this PR those callers were routed through the legacy PDF/document graph; after this PR they will use MultiTypeExtractOperator. Because "auto" and "pdf" follow different processing paths for PDF inputs (as the new docstring acknowledges), this is a breaking API change that should go through a deprecation cycle rather than a silent default swap. Without a cycle, any pipeline that relied on the PDF graph's behavior will produce different output without a clear signal.

Suggested change
extraction_mode: str = "auto",
extraction_mode: str = "auto", # TODO(deprecation): warn callers who relied on the old "pdf" default

Rule Used: Changes to public API surfaces (FastAPI endpoints,... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/graph_ingestor.py
Line: 184

Comment:
Changing the default from `"pdf"` to `"auto"` silently changes behavior for every existing caller who invokes `.extract()` without an explicit `extraction_mode`. Prior to this PR those callers were routed through the legacy PDF/document graph; after this PR they will use `MultiTypeExtractOperator`. Because `"auto"` and `"pdf"` follow different processing paths for PDF inputs (as the new docstring acknowledges), this is a breaking API change that should go through a deprecation cycle rather than a silent default swap. Without a cycle, any pipeline that relied on the PDF graph's behavior will produce different output without a clear signal.

```suggestion
        extraction_mode: str = "auto",  # TODO(deprecation): warn callers who relied on the old "pdf" default
```

**Rule Used:** Changes to public API surfaces (FastAPI endpoints,... ([source](https://app.greptile.com/review/custom-context?memory=api-backward-compatibility))

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +92 to +95
monkeypatch.setattr(
"nemo_retriever.graph.multi_type_extract_operator._MultiTypeExtractBase._run_detection_pipeline",
lambda self, batch_df: batch_df,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Mocking too deep in the stack — fragile tests ahead. Both new tests patch _MultiTypeExtractBase._run_detection_pipeline and _run_pdf_pipeline, which are private implementation methods two levels below the public boundary. The mocking-discipline rule says to mock at the boundary (the public operator entry point or the I/O seam), not inside the call stack. A future refactoring that renames or restructures these private helpers will silently pass these tests even though the real dispatch logic has changed.

Rule Used: Tests must mock all external service dependencies ... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/tests/test_ingest_interface.py
Line: 92-95

Comment:
Mocking too deep in the stack — fragile tests ahead. Both new tests patch `_MultiTypeExtractBase._run_detection_pipeline` and `_run_pdf_pipeline`, which are private implementation methods two levels below the public boundary. The mocking-discipline rule says to mock at the boundary (the public operator entry point or the I/O seam), not inside the call stack. A future refactoring that renames or restructures these private helpers will silently pass these tests even though the real dispatch logic has changed.

**Rule Used:** Tests must mock all external service dependencies ... ([source](https://app.greptile.com/review/custom-context?memory=mock-external-services))

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants