Fix direct image extraction dispatch#1996
Conversation
Greptile SummaryThis PR changes
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/graph_ingestor.py | Default extraction_mode on extract() changed from "pdf" to "auto"; docstring and one comment updated — no other logic changes. |
| nemo_retriever/tests/test_ingest_interface.py | Updated default-mode assertion and two new tests; both new tests monkeypatch private class methods two levels below the public boundary without spec-checking. |
| nemo_retriever/src/nemo_retriever/graph_ingestor.md | New reference document; well-structured, accurate descriptions of run modes, extraction modes, dispatch groups, and post-extraction stages. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["GraphIngestor.extract()\nextraction_mode='auto' (NEW default)"] --> B{MultiTypeExtractOperator\ndispatch by extension}
B -->|".pdf / .docx / .pptx"| C["DocToPdfConversion → PDFSplit\n→ PDF Extraction → Detection\n→ Optional OCR/Table/Chart"]
B -->|".png / .jpg / .bmp / .tiff / .tif"| D["ImageLoadActor → Detection\n→ Optional OCR/Table/Chart"]
B -->|".txt"| E["TxtSplitActor"]
B -->|".html"| F["HtmlSplitActor"]
B -->|".mp3 / .wav"| G["MediaChunkActor → ASRActor"]
B -->|".mp4 / .mov / .mkv"| H["Video Frame Extraction\n+ Optional ASR + AV Fusion"]
A2["GraphIngestor.extract(extraction_mode='pdf')\n(explicit override)"] --> C2["PDF/document graph\ndirectly — no extension dispatch"]
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
nemo_retriever/tests/test_ingest_interface.py:165-168
**PDF branch goes unverified in the mixed-dispatch test**
The test confirms both file paths appear in the result and checks the image row thoroughly, but never asserts anything about the PDF row's content (e.g. `text`, `metadata["error"]`, or `page_image` being `None`). Because `_run_pdf_pipeline` is mocked to return a hand-crafted DataFrame, a regression in how `MultiTypeExtractOperator` merges or passes through the PDF branch output would leave these fields silently wrong while this test still passes. Adding at minimum `pdf_row["text"] == "pdf text"` and `pdf_row["metadata"]["error"] is None` would close that gap.
Reviews (4): Last reviewed commit: "Merge branch 'main' into investigate/ima..." | Re-trigger Greptile
| *, | ||
| split_config: dict[str, Any] | None = None, | ||
| extraction_mode: str = "pdf", | ||
| extraction_mode: str = "auto", |
There was a problem hiding this comment.
Changing the default from
"pdf" to "auto" silently changes behavior for every existing caller who invokes .extract() without an explicit extraction_mode. Prior to this PR those callers were routed through the legacy PDF/document graph; after this PR they will use MultiTypeExtractOperator. Because "auto" and "pdf" follow different processing paths for PDF inputs (as the new docstring acknowledges), this is a breaking API change that should go through a deprecation cycle rather than a silent default swap. Without a cycle, any pipeline that relied on the PDF graph's behavior will produce different output without a clear signal.
| extraction_mode: str = "auto", | |
| extraction_mode: str = "auto", # TODO(deprecation): warn callers who relied on the old "pdf" default |
Rule Used: Changes to public API surfaces (FastAPI endpoints,... (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/graph_ingestor.py
Line: 184
Comment:
Changing the default from `"pdf"` to `"auto"` silently changes behavior for every existing caller who invokes `.extract()` without an explicit `extraction_mode`. Prior to this PR those callers were routed through the legacy PDF/document graph; after this PR they will use `MultiTypeExtractOperator`. Because `"auto"` and `"pdf"` follow different processing paths for PDF inputs (as the new docstring acknowledges), this is a breaking API change that should go through a deprecation cycle rather than a silent default swap. Without a cycle, any pipeline that relied on the PDF graph's behavior will produce different output without a clear signal.
```suggestion
extraction_mode: str = "auto", # TODO(deprecation): warn callers who relied on the old "pdf" default
```
**Rule Used:** Changes to public API surfaces (FastAPI endpoints,... ([source](https://app.greptile.com/review/custom-context?memory=api-backward-compatibility))
How can I resolve this? If you propose a fix, please make it concise.| monkeypatch.setattr( | ||
| "nemo_retriever.graph.multi_type_extract_operator._MultiTypeExtractBase._run_detection_pipeline", | ||
| lambda self, batch_df: batch_df, | ||
| ) |
There was a problem hiding this comment.
Mocking too deep in the stack — fragile tests ahead. Both new tests patch
_MultiTypeExtractBase._run_detection_pipeline and _run_pdf_pipeline, which are private implementation methods two levels below the public boundary. The mocking-discipline rule says to mock at the boundary (the public operator entry point or the I/O seam), not inside the call stack. A future refactoring that renames or restructures these private helpers will silently pass these tests even though the real dispatch logic has changed.
Rule Used: Tests must mock all external service dependencies ... (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/tests/test_ingest_interface.py
Line: 92-95
Comment:
Mocking too deep in the stack — fragile tests ahead. Both new tests patch `_MultiTypeExtractBase._run_detection_pipeline` and `_run_pdf_pipeline`, which are private implementation methods two levels below the public boundary. The mocking-discipline rule says to mock at the boundary (the public operator entry point or the I/O seam), not inside the call stack. A future refactoring that renames or restructures these private helpers will silently pass these tests even though the real dispatch logic has changed.
**Rule Used:** Tests must mock all external service dependencies ... ([source](https://app.greptile.com/review/custom-context?memory=mock-external-services))
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Summary
GraphIngestor.extract()toextraction_mode="auto"so supported non-PDF inputs are routed throughMultiTypeExtractOperatorby extension.extraction_mode="pdf"as the explicit dedicated PDF/document graph path.graph_ingestor.mddocumenting GraphIngestor usage, run modes, extraction modes, supported auto-dispatch groups, split configuration, and post-extraction stages.Testing
python -m pytest nemo_retriever/tests/test_ingest_interface.py -q