[bug-fix] Normalising missing input path errors in graph ingest across modes by mahikaw · Pull Request #2007 · NVIDIA/NeMo-Retriever

mahikaw · 2026-05-08T20:21:28Z

Description

Validated graph ingest input paths before file loading in both batch and inprocess modes.
Raised a consistent NeMo-level FileNotFoundError for missing literal paths instead of leaking Ray/PyArrow errors (batch mode) or silently returning an empty DataFrame (in-process mode)

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

greptile-apps · 2026-05-08T20:25:21Z

Greptile Summary

This PR normalises input-path error handling for graph ingest by extracting shared helpers (expand_input_file_patterns, raise_input_path_not_found, _is_explicit_glob_path) into input_files.py and wiring them into both InprocessExecutor and RayDataExecutor.

input_files.py: new helpers pre-validate literal paths before handing them to file readers, and glob expansion is now filtered to files-only, so matched directories no longer silently enter the read pipeline.
executor.py: InprocessExecutor._load_files raises a normalised FileNotFoundError for unresolved non-glob paths; RayDataExecutor.ingest catches FileNotFoundError from rd.read_binary_files and re-raises it with a consistent product-level message.
test_pipeline_graph.py: seven new test cases cover missing-literal-path rejection, unmatched-glob passthrough, directory expansion, and Ray reader error normalisation for both executors.

Confidence Score: 4/5

Safe to merge for local-filesystem workflows; any caller passing remote URIs (s3://, gs://) to RayDataExecutor will now get a premature FileNotFoundError instead of a successful Ray read.

The local-path validation in expand_input_file_patterns uses Path(...).exists() without first checking for a URI scheme. Remote paths that glob.glob cannot expand locally will fail the existence check and be rejected before Ray ever sees them, reversing behaviour that worked correctly in the old code. All other changed logic is correct and well-tested.

nemo_retriever/src/nemo_retriever/utils/input_files.py — expand_input_file_patterns needs a URI-scheme guard before the Path.exists() check

Important Files Changed

Filename	Overview
nemo_retriever/src/nemo_retriever/utils/input_files.py	Adds expand_input_file_patterns and raise_input_path_not_found helpers; local-only path validation will reject remote URIs (s3://, gs://) that were previously passed through to Ray
nemo_retriever/src/nemo_retriever/graph/executor.py	Centralises glob expansion into expand_input_file_patterns and catches FileNotFoundError from rd.read_binary_files; wires up new helpers cleanly
nemo_retriever/tests/test_pipeline_graph.py	Adds 7 new test cases covering missing-path rejection, unmatched globs, and directory expansion for both executors; tests are well-scoped and isolated

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["ingest(data)"] --> B{data type?}
    B -->|"pd.DataFrame / rd.Dataset"| C[Use directly]
    B -->|"str / list[str]"| D["expand_input_file_patterns(data)"]
    D --> E{For each path}
    E --> F["glob.glob(pattern) files only"]
    F -->|matches| G[Extend expanded with sorted matches]
    F -->|no matches + is_glob| H[Append raw pattern to expanded]
    F -->|no matches + not glob + not exists| I["raise_input_path_not_found() FileNotFoundError"]
    F -->|no matches + not glob + exists| J[Append path - directory passthrough]
    G & H & J --> K[return expanded list]
    K --> L{Executor?}
    L -->|InprocessExecutor| M["_load_files(expanded)"]
    M --> N{For each path p}
    N -->|is_file| O[Read bytes to DataFrame row]
    N -->|not file + not glob| P["raise_input_path_not_found(p)"]
    N -->|not file + is_glob| Q[Skip silently]
    L -->|RayDataExecutor| R["rd.read_binary_files(expanded)"]
    R -->|success| S[Ray Dataset]
    R -->|FileNotFoundError| T["raise_input_path_not_found(input_paths, exc)"]

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
nemo_retriever/src/nemo_retriever/utils/input_files.py:57-66
**Remote URI paths rejected as missing local files**

`expand_input_file_patterns` performs local-filesystem validation on every non-glob path via `Path(pattern).exists()`. Any remote URI that `RayDataExecutor` previously accepted — e.g. `s3://my-bucket/docs/file.pdf` or `gs://...` — has no glob magic, returns `[]` from `glob.glob`, and fails `Path(...).exists()`, so `raise_input_path_not_found` fires before Ray is ever reached. In the old code the same path fell through `glob.glob` returning `[]` and was passed unchanged to `rd.read_binary_files`. This is a regression: callers who use Ray's remote-filesystem support will now get `FileNotFoundError("Input path does not exist: s3://...")` instead of a working read.

### Issue 2 of 2
nemo_retriever/src/nemo_retriever/utils/input_files.py:63-64
**Unmatched local glob passed through to Ray may produce a confusing error**

When a glob like `/tmp/nothing/*.pdf` matches no files locally, the raw pattern string is appended to `expanded` and handed to `rd.read_binary_files`. Ray (via PyArrow) will attempt to expand it; if it yields zero matches it raises `FileNotFoundError`, which gets caught and re-raised as `"Input path does not exist: ['/tmp/nothing/*.pdf']"` — misleading because the pattern is valid, it simply matched nothing. `InprocessExecutor` already silently returns an empty DataFrame for the same input (`_load_files` skips entries where `_is_explicit_glob_path` is `True`). Adding a test and aligning `RayDataExecutor`'s zero-match-glob behaviour to return an empty dataset (or at least produce a clearer message) would remove the inconsistency.

_{Reviews (5): Last reviewed commit: "added dir expansion tests for direct gra..." | Re-trigger Greptile}

jperez999 · 2026-05-08T21:19:33Z

+
+        local_path = Path(raw_path).expanduser()
+        if local_path.is_dir():
+            raise IsADirectoryError(f"Input path is a directory, not a file or glob pattern: {local_path}")


So i think we want to allow directory level input paths and we assume everything in that path is part of the input data.

jperez999 · 2026-05-08T21:25:08Z

+InputPath = str | PathLike[str]
+
+
+def _has_uri_scheme(path: str) -> bool:


What pattern were you expecting to leverage this urlparse?

…ess executors

greptile-apps · 2026-05-08T23:10:47Z

+    for input_path in paths:
+        raw_path = fspath(input_path)
+        pattern = str(Path(raw_path).expanduser())
+        matches = [match for match in glob.glob(pattern, recursive=True) if Path(match).is_file()]
+        if matches:
+            expanded.extend(sorted(matches))
+        elif _is_explicit_glob_path(pattern):
+            expanded.append(pattern)
+        elif not Path(pattern).exists():
+            raise_input_path_not_found(pattern)


Remote URI paths rejected as missing local files

expand_input_file_patterns performs local-filesystem validation on every non-glob path via Path(pattern).exists(). Any remote URI that RayDataExecutor previously accepted — e.g. s3://my-bucket/docs/file.pdf or gs://... — has no glob magic, returns [] from glob.glob, and fails Path(...).exists(), so raise_input_path_not_found fires before Ray is ever reached. In the old code the same path fell through glob.glob returning [] and was passed unchanged to rd.read_binary_files. This is a regression: callers who use Ray's remote-filesystem support will now get FileNotFoundError("Input path does not exist: s3://...") instead of a working read.

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/utils/input_files.py Line: 57-66 Comment: **Remote URI paths rejected as missing local files** `expand_input_file_patterns` performs local-filesystem validation on every non-glob path via `Path(pattern).exists()`. Any remote URI that `RayDataExecutor` previously accepted — e.g. `s3://my-bucket/docs/file.pdf` or `gs://...` — has no glob magic, returns `[]` from `glob.glob`, and fails `Path(...).exists()`, so `raise_input_path_not_found` fires before Ray is ever reached. In the old code the same path fell through `glob.glob` returning `[]` and was passed unchanged to `rd.read_binary_files`. This is a regression: callers who use Ray's remote-filesystem support will now get `FileNotFoundError("Input path does not exist: s3://...")` instead of a working read. How can I resolve this? If you propose a fix, please make it concise.

mahikaw requested review from a team as code owners May 8, 2026 20:21

mahikaw requested a review from ChrisJar May 8, 2026 20:21

mahikaw changed the title ~~[Bug-fix] Normalising missing input path errors in graph ingest across modes~~ [bug-fix] Normalising missing input path errors in graph ingest across modes May 8, 2026

greptile-apps Bot reviewed May 8, 2026

View reviewed changes

Comment thread nemo_retriever/src/nemo_retriever/utils/input_files.py Outdated

Comment thread nemo_retriever/src/nemo_retriever/utils/input_files.py Outdated

mahikaw added 2 commits May 8, 2026 13:55

normalising missing input path errors in graph ingest across modes

89d3316

added isDir error handling and other greptile review comments

8a6e5cd

mahikaw force-pushed the dev/mahikaw/bug_fix_missing_input_path_handling branch from 8851dac to 8a6e5cd Compare May 8, 2026 20:55

greptile-apps Bot reviewed May 8, 2026

View reviewed changes

Comment thread nemo_retriever/src/nemo_retriever/graph/executor.py Outdated

added explicit is file check

bd02d60

jperez999 requested changes May 8, 2026

View reviewed changes

mahikaw added 2 commits May 8, 2026 15:52

addresses review comments

2d1a546

added dir expansion tests for direct graph ingestor using ray/in-proc…

daeddb4

…ess executors

greptile-apps Bot reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug-fix] Normalising missing input path errors in graph ingest across modes#2007

[bug-fix] Normalising missing input path errors in graph ingest across modes#2007
mahikaw wants to merge 5 commits intomainfrom
dev/mahikaw/bug_fix_missing_input_path_handling

mahikaw commented May 8, 2026

Uh oh!

greptile-apps Bot commented May 8, 2026 •

edited

Loading

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jperez999 May 8, 2026

Uh oh!

jperez999 May 8, 2026

Uh oh!

Uh oh!

greptile-apps Bot May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		InputPath = str \| PathLike[str]


		def _has_uri_scheme(path: str) -> bool:

Conversation

mahikaw commented May 8, 2026

Description

Checklist

Uh oh!

greptile-apps Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jperez999 May 8, 2026

Choose a reason for hiding this comment

Uh oh!

jperez999 May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented May 8, 2026 •

edited

Loading