Skip to content

[bug-fix] Normalising missing input path errors in graph ingest across modes#2007

Open
mahikaw wants to merge 5 commits intomainfrom
dev/mahikaw/bug_fix_missing_input_path_handling
Open

[bug-fix] Normalising missing input path errors in graph ingest across modes#2007
mahikaw wants to merge 5 commits intomainfrom
dev/mahikaw/bug_fix_missing_input_path_handling

Conversation

@mahikaw
Copy link
Copy Markdown
Collaborator

@mahikaw mahikaw commented May 8, 2026

Description

  • Validated graph ingest input paths before file loading in both batch and inprocess modes.
  • Raised a consistent NeMo-level FileNotFoundError for missing literal paths instead of leaking Ray/PyArrow errors (batch mode) or silently returning an empty DataFrame (in-process mode)

Addresses 6154864

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@mahikaw mahikaw requested review from a team as code owners May 8, 2026 20:21
@mahikaw mahikaw requested a review from ChrisJar May 8, 2026 20:21
@mahikaw mahikaw changed the title [Bug-fix] Normalising missing input path errors in graph ingest across modes [bug-fix] Normalising missing input path errors in graph ingest across modes May 8, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 8, 2026

Greptile Summary

This PR normalises input-path error handling for graph ingest by extracting shared helpers (expand_input_file_patterns, raise_input_path_not_found, _is_explicit_glob_path) into input_files.py and wiring them into both InprocessExecutor and RayDataExecutor.

  • input_files.py: new helpers pre-validate literal paths before handing them to file readers, and glob expansion is now filtered to files-only, so matched directories no longer silently enter the read pipeline.
  • executor.py: InprocessExecutor._load_files raises a normalised FileNotFoundError for unresolved non-glob paths; RayDataExecutor.ingest catches FileNotFoundError from rd.read_binary_files and re-raises it with a consistent product-level message.
  • test_pipeline_graph.py: seven new test cases cover missing-literal-path rejection, unmatched-glob passthrough, directory expansion, and Ray reader error normalisation for both executors.

Confidence Score: 4/5

Safe to merge for local-filesystem workflows; any caller passing remote URIs (s3://, gs://) to RayDataExecutor will now get a premature FileNotFoundError instead of a successful Ray read.

The local-path validation in expand_input_file_patterns uses Path(...).exists() without first checking for a URI scheme. Remote paths that glob.glob cannot expand locally will fail the existence check and be rejected before Ray ever sees them, reversing behaviour that worked correctly in the old code. All other changed logic is correct and well-tested.

nemo_retriever/src/nemo_retriever/utils/input_files.py — expand_input_file_patterns needs a URI-scheme guard before the Path.exists() check

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/utils/input_files.py Adds expand_input_file_patterns and raise_input_path_not_found helpers; local-only path validation will reject remote URIs (s3://, gs://) that were previously passed through to Ray
nemo_retriever/src/nemo_retriever/graph/executor.py Centralises glob expansion into expand_input_file_patterns and catches FileNotFoundError from rd.read_binary_files; wires up new helpers cleanly
nemo_retriever/tests/test_pipeline_graph.py Adds 7 new test cases covering missing-path rejection, unmatched globs, and directory expansion for both executors; tests are well-scoped and isolated

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["ingest(data)"] --> B{data type?}
    B -->|"pd.DataFrame / rd.Dataset"| C[Use directly]
    B -->|"str / list[str]"| D["expand_input_file_patterns(data)"]
    D --> E{For each path}
    E --> F["glob.glob(pattern) files only"]
    F -->|matches| G[Extend expanded with sorted matches]
    F -->|no matches + is_glob| H[Append raw pattern to expanded]
    F -->|no matches + not glob + not exists| I["raise_input_path_not_found() FileNotFoundError"]
    F -->|no matches + not glob + exists| J[Append path - directory passthrough]
    G & H & J --> K[return expanded list]
    K --> L{Executor?}
    L -->|InprocessExecutor| M["_load_files(expanded)"]
    M --> N{For each path p}
    N -->|is_file| O[Read bytes to DataFrame row]
    N -->|not file + not glob| P["raise_input_path_not_found(p)"]
    N -->|not file + is_glob| Q[Skip silently]
    L -->|RayDataExecutor| R["rd.read_binary_files(expanded)"]
    R -->|success| S[Ray Dataset]
    R -->|FileNotFoundError| T["raise_input_path_not_found(input_paths, exc)"]
Loading
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
nemo_retriever/src/nemo_retriever/utils/input_files.py:57-66
**Remote URI paths rejected as missing local files**

`expand_input_file_patterns` performs local-filesystem validation on every non-glob path via `Path(pattern).exists()`. Any remote URI that `RayDataExecutor` previously accepted — e.g. `s3://my-bucket/docs/file.pdf` or `gs://...` — has no glob magic, returns `[]` from `glob.glob`, and fails `Path(...).exists()`, so `raise_input_path_not_found` fires before Ray is ever reached. In the old code the same path fell through `glob.glob` returning `[]` and was passed unchanged to `rd.read_binary_files`. This is a regression: callers who use Ray's remote-filesystem support will now get `FileNotFoundError("Input path does not exist: s3://...")` instead of a working read.

### Issue 2 of 2
nemo_retriever/src/nemo_retriever/utils/input_files.py:63-64
**Unmatched local glob passed through to Ray may produce a confusing error**

When a glob like `/tmp/nothing/*.pdf` matches no files locally, the raw pattern string is appended to `expanded` and handed to `rd.read_binary_files`. Ray (via PyArrow) will attempt to expand it; if it yields zero matches it raises `FileNotFoundError`, which gets caught and re-raised as `"Input path does not exist: ['/tmp/nothing/*.pdf']"` — misleading because the pattern is valid, it simply matched nothing. `InprocessExecutor` already silently returns an empty DataFrame for the same input (`_load_files` skips entries where `_is_explicit_glob_path` is `True`). Adding a test and aligning `RayDataExecutor`'s zero-match-glob behaviour to return an empty dataset (or at least produce a clearer message) would remove the inconsistency.

Reviews (5): Last reviewed commit: "added dir expansion tests for direct gra..." | Re-trigger Greptile

Comment thread nemo_retriever/src/nemo_retriever/utils/input_files.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/utils/input_files.py Outdated
@mahikaw mahikaw force-pushed the dev/mahikaw/bug_fix_missing_input_path_handling branch from 8851dac to 8a6e5cd Compare May 8, 2026 20:55
Comment thread nemo_retriever/src/nemo_retriever/graph/executor.py Outdated

local_path = Path(raw_path).expanduser()
if local_path.is_dir():
raise IsADirectoryError(f"Input path is a directory, not a file or glob pattern: {local_path}")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So i think we want to allow directory level input paths and we assume everything in that path is part of the input data.

InputPath = str | PathLike[str]


def _has_uri_scheme(path: str) -> bool:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What pattern were you expecting to leverage this urlparse?

Comment thread nemo_retriever/src/nemo_retriever/utils/input_files.py Outdated
Comment on lines +57 to +66
for input_path in paths:
raw_path = fspath(input_path)
pattern = str(Path(raw_path).expanduser())
matches = [match for match in glob.glob(pattern, recursive=True) if Path(match).is_file()]
if matches:
expanded.extend(sorted(matches))
elif _is_explicit_glob_path(pattern):
expanded.append(pattern)
elif not Path(pattern).exists():
raise_input_path_not_found(pattern)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Remote URI paths rejected as missing local files

expand_input_file_patterns performs local-filesystem validation on every non-glob path via Path(pattern).exists(). Any remote URI that RayDataExecutor previously accepted — e.g. s3://my-bucket/docs/file.pdf or gs://... — has no glob magic, returns [] from glob.glob, and fails Path(...).exists(), so raise_input_path_not_found fires before Ray is ever reached. In the old code the same path fell through glob.glob returning [] and was passed unchanged to rd.read_binary_files. This is a regression: callers who use Ray's remote-filesystem support will now get FileNotFoundError("Input path does not exist: s3://...") instead of a working read.

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/utils/input_files.py
Line: 57-66

Comment:
**Remote URI paths rejected as missing local files**

`expand_input_file_patterns` performs local-filesystem validation on every non-glob path via `Path(pattern).exists()`. Any remote URI that `RayDataExecutor` previously accepted — e.g. `s3://my-bucket/docs/file.pdf` or `gs://...` — has no glob magic, returns `[]` from `glob.glob`, and fails `Path(...).exists()`, so `raise_input_path_not_found` fires before Ray is ever reached. In the old code the same path fell through `glob.glob` returning `[]` and was passed unchanged to `rd.read_binary_files`. This is a regression: callers who use Ray's remote-filesystem support will now get `FileNotFoundError("Input path does not exist: s3://...")` instead of a working read.

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants