Fix direct image extraction dispatch by jioffe502 · Pull Request #1996 · NVIDIA/NeMo-Retriever

jioffe502 · 2026-05-07T19:00:45Z

Summary

Default GraphIngestor.extract() to extraction_mode="auto" so supported non-PDF inputs are routed through MultiTypeExtractOperator by extension.
Preserve extraction_mode="pdf" as the explicit dedicated PDF/document graph path.
Add coverage for BMP/TIFF/TIF page image materialization and mixed PDF+BMP auto dispatch.
Add graph_ingestor.md documenting GraphIngestor usage, run modes, extraction modes, supported auto-dispatch groups, split configuration, and post-extraction stages.

Testing

python -m pytest nemo_retriever/tests/test_ingest_interface.py -q
Full JP20 pipeline smoke-tested with the auto dispatch path.

greptile-apps · 2026-05-07T19:11:55Z

Greptile Summary

This PR changes GraphIngestor.extract() to default extraction_mode=\"auto\" (previously \"pdf\") so that non-PDF file types such as BMP and TIFF are automatically routed through MultiTypeExtractOperator by extension, while extraction_mode=\"pdf\" remains available as an explicit override. A new graph_ingestor.md reference document and two new tests covering direct image materialization and mixed-corpus dispatch are also included.

graph_ingestor.py: Single default-value change on extract() with an updated docstring and a minor comment clarification in _build_graph.
test_ingest_interface.py: Updates the existing test_extract_unified_defaults assertion and adds test_extract_default_materializes_direct_image_page_image (BMP/TIFF parametrized) and test_extract_default_auto_dispatches_mixed_supported_inputs.
graph_ingestor.md: New reference doc covering run modes, extraction modes, ExtractParams fields, split_config, and post-extraction stages.

Confidence Score: 5/5

The change is mechanically small — one default-value swap, a docstring update, and two new tests — and the documentation is thorough about the behavioral difference between 'auto' and 'pdf' modes.

The core logic change is confined to a single parameter default; the auto-dispatch path through MultiTypeExtractOperator existed before this PR and is not being restructured. The new tests exercise BMP/TIFF materialization and mixed-corpus routing with the image branch fully verified. No new unsafe patterns or data-loss paths were introduced by this diff.

Both test functions in test_ingest_interface.py deserve a closer look — the PDF row assertions in the mixed-dispatch test are thin, and the monkeypatching targets private methods two levels below the public boundary.

Important Files Changed

Filename	Overview
nemo_retriever/src/nemo_retriever/graph_ingestor.py	Default `extraction_mode` on `extract()` changed from `"pdf"` to `"auto"`; docstring and one comment updated — no other logic changes.
nemo_retriever/tests/test_ingest_interface.py	Updated default-mode assertion and two new tests; both new tests monkeypatch private class methods two levels below the public boundary without spec-checking.
nemo_retriever/src/nemo_retriever/graph_ingestor.md	New reference document; well-structured, accurate descriptions of run modes, extraction modes, dispatch groups, and post-extraction stages.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["GraphIngestor.extract()\nextraction_mode='auto' (NEW default)"] --> B{MultiTypeExtractOperator\ndispatch by extension}
    B -->|".pdf / .docx / .pptx"| C["DocToPdfConversion → PDFSplit\n→ PDF Extraction → Detection\n→ Optional OCR/Table/Chart"]
    B -->|".png / .jpg / .bmp / .tiff / .tif"| D["ImageLoadActor → Detection\n→ Optional OCR/Table/Chart"]
    B -->|".txt"| E["TxtSplitActor"]
    B -->|".html"| F["HtmlSplitActor"]
    B -->|".mp3 / .wav"| G["MediaChunkActor → ASRActor"]
    B -->|".mp4 / .mov / .mkv"| H["Video Frame Extraction\n+ Optional ASR + AV Fusion"]
    A2["GraphIngestor.extract(extraction_mode='pdf')\n(explicit override)"] --> C2["PDF/document graph\ndirectly — no extension dispatch"]

Prompt To Fix All With AI

Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
nemo_retriever/tests/test_ingest_interface.py:165-168
**PDF branch goes unverified in the mixed-dispatch test**

The test confirms both file paths appear in the result and checks the image row thoroughly, but never asserts anything about the PDF row's content (e.g. `text`, `metadata["error"]`, or `page_image` being `None`). Because `_run_pdf_pipeline` is mocked to return a hand-crafted DataFrame, a regression in how `MultiTypeExtractOperator` merges or passes through the PDF branch output would leave these fields silently wrong while this test still passes. Adding at minimum `pdf_row["text"] == "pdf text"` and `pdf_row["metadata"]["error"] is None` would close that gap.

_{Reviews (4): Last reviewed commit: "Merge branch 'main' into investigate/ima..." | Re-trigger Greptile}

greptile-apps · 2026-05-07T19:11:58Z

        *,
        split_config: dict[str, Any] | None = None,
-        extraction_mode: str = "pdf",
+        extraction_mode: str = "auto",


Changing the default from "pdf" to "auto" silently changes behavior for every existing caller who invokes .extract() without an explicit extraction_mode. Prior to this PR those callers were routed through the legacy PDF/document graph; after this PR they will use MultiTypeExtractOperator. Because "auto" and "pdf" follow different processing paths for PDF inputs (as the new docstring acknowledges), this is a breaking API change that should go through a deprecation cycle rather than a silent default swap. Without a cycle, any pipeline that relied on the PDF graph's behavior will produce different output without a clear signal.

Suggested change

extraction_mode: str = "auto",

extraction_mode: str = "auto", # TODO(deprecation): warn callers who relied on the old "pdf" default

Rule Used: Changes to public API surfaces (FastAPI endpoints,... (source)

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/graph_ingestor.py Line: 184 Comment: Changing the default from `"pdf"` to `"auto"` silently changes behavior for every existing caller who invokes `.extract()` without an explicit `extraction_mode`. Prior to this PR those callers were routed through the legacy PDF/document graph; after this PR they will use `MultiTypeExtractOperator`. Because `"auto"` and `"pdf"` follow different processing paths for PDF inputs (as the new docstring acknowledges), this is a breaking API change that should go through a deprecation cycle rather than a silent default swap. Without a cycle, any pipeline that relied on the PDF graph's behavior will produce different output without a clear signal. ```suggestion extraction_mode: str = "auto", # TODO(deprecation): warn callers who relied on the old "pdf" default ``` **Rule Used:** Changes to public API surfaces (FastAPI endpoints,... ([source](https://app.greptile.com/review/custom-context?memory=api-backward-compatibility)) How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-05-07T19:12:03Z

+    monkeypatch.setattr(
+        "nemo_retriever.graph.multi_type_extract_operator._MultiTypeExtractBase._run_detection_pipeline",
+        lambda self, batch_df: batch_df,
+    )


Mocking too deep in the stack — fragile tests ahead. Both new tests patch _MultiTypeExtractBase._run_detection_pipeline and _run_pdf_pipeline, which are private implementation methods two levels below the public boundary. The mocking-discipline rule says to mock at the boundary (the public operator entry point or the I/O seam), not inside the call stack. A future refactoring that renames or restructures these private helpers will silently pass these tests even though the real dispatch logic has changed.

Rule Used: Tests must mock all external service dependencies ... (source)

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/tests/test_ingest_interface.py Line: 92-95 Comment: Mocking too deep in the stack — fragile tests ahead. Both new tests patch `_MultiTypeExtractBase._run_detection_pipeline` and `_run_pdf_pipeline`, which are private implementation methods two levels below the public boundary. The mocking-discipline rule says to mock at the boundary (the public operator entry point or the I/O seam), not inside the call stack. A future refactoring that renames or restructures these private helpers will silently pass these tests even though the real dispatch logic has changed. **Rule Used:** Tests must mock all external service dependencies ... ([source](https://app.greptile.com/review/custom-context?memory=mock-external-services)) How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Fix direct image extraction dispatch

a825b75

jioffe502 requested review from a team as code owners May 7, 2026 19:00

jioffe502 requested a review from jperez999 May 7, 2026 19:00

greptile-apps Bot reviewed May 7, 2026

View reviewed changes

jioffe502 and others added 3 commits May 7, 2026 20:57

Document GraphIngestor extraction modes

e6148f5

Clarify PDF extraction mode wording

aa81017

Merge branch 'main' into investigate/image_extraction_bug

e4381bd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix direct image extraction dispatch#1996

Fix direct image extraction dispatch#1996
jioffe502 wants to merge 4 commits into
NVIDIA:mainfrom
jioffe502:investigate/image_extraction_bug

jioffe502 commented May 7, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 7, 2026 •

edited

Loading

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot May 7, 2026

Uh oh!

greptile-apps Bot May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	extraction_mode: str = "auto",
	extraction_mode: str = "auto", # TODO(deprecation): warn callers who relied on the old "pdf" default

Conversation

jioffe502 commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

greptile-apps Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jioffe502 commented May 7, 2026 •

edited

Loading

greptile-apps Bot commented May 7, 2026 •

edited

Loading