[bug-fix] Normalising missing input path errors in graph ingest across modes#2007
[bug-fix] Normalising missing input path errors in graph ingest across modes#2007
Conversation
Greptile SummaryThis PR normalises input-path error handling for graph ingest by extracting shared helpers (
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/utils/input_files.py | Adds expand_input_file_patterns and raise_input_path_not_found helpers; local-only path validation will reject remote URIs (s3://, gs://) that were previously passed through to Ray |
| nemo_retriever/src/nemo_retriever/graph/executor.py | Centralises glob expansion into expand_input_file_patterns and catches FileNotFoundError from rd.read_binary_files; wires up new helpers cleanly |
| nemo_retriever/tests/test_pipeline_graph.py | Adds 7 new test cases covering missing-path rejection, unmatched globs, and directory expansion for both executors; tests are well-scoped and isolated |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["ingest(data)"] --> B{data type?}
B -->|"pd.DataFrame / rd.Dataset"| C[Use directly]
B -->|"str / list[str]"| D["expand_input_file_patterns(data)"]
D --> E{For each path}
E --> F["glob.glob(pattern) files only"]
F -->|matches| G[Extend expanded with sorted matches]
F -->|no matches + is_glob| H[Append raw pattern to expanded]
F -->|no matches + not glob + not exists| I["raise_input_path_not_found() FileNotFoundError"]
F -->|no matches + not glob + exists| J[Append path - directory passthrough]
G & H & J --> K[return expanded list]
K --> L{Executor?}
L -->|InprocessExecutor| M["_load_files(expanded)"]
M --> N{For each path p}
N -->|is_file| O[Read bytes to DataFrame row]
N -->|not file + not glob| P["raise_input_path_not_found(p)"]
N -->|not file + is_glob| Q[Skip silently]
L -->|RayDataExecutor| R["rd.read_binary_files(expanded)"]
R -->|success| S[Ray Dataset]
R -->|FileNotFoundError| T["raise_input_path_not_found(input_paths, exc)"]
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
nemo_retriever/src/nemo_retriever/utils/input_files.py:57-66
**Remote URI paths rejected as missing local files**
`expand_input_file_patterns` performs local-filesystem validation on every non-glob path via `Path(pattern).exists()`. Any remote URI that `RayDataExecutor` previously accepted — e.g. `s3://my-bucket/docs/file.pdf` or `gs://...` — has no glob magic, returns `[]` from `glob.glob`, and fails `Path(...).exists()`, so `raise_input_path_not_found` fires before Ray is ever reached. In the old code the same path fell through `glob.glob` returning `[]` and was passed unchanged to `rd.read_binary_files`. This is a regression: callers who use Ray's remote-filesystem support will now get `FileNotFoundError("Input path does not exist: s3://...")` instead of a working read.
### Issue 2 of 2
nemo_retriever/src/nemo_retriever/utils/input_files.py:63-64
**Unmatched local glob passed through to Ray may produce a confusing error**
When a glob like `/tmp/nothing/*.pdf` matches no files locally, the raw pattern string is appended to `expanded` and handed to `rd.read_binary_files`. Ray (via PyArrow) will attempt to expand it; if it yields zero matches it raises `FileNotFoundError`, which gets caught and re-raised as `"Input path does not exist: ['/tmp/nothing/*.pdf']"` — misleading because the pattern is valid, it simply matched nothing. `InprocessExecutor` already silently returns an empty DataFrame for the same input (`_load_files` skips entries where `_is_explicit_glob_path` is `True`). Adding a test and aligning `RayDataExecutor`'s zero-match-glob behaviour to return an empty dataset (or at least produce a clearer message) would remove the inconsistency.
Reviews (5): Last reviewed commit: "added dir expansion tests for direct gra..." | Re-trigger Greptile
8851dac to
8a6e5cd
Compare
|
|
||
| local_path = Path(raw_path).expanduser() | ||
| if local_path.is_dir(): | ||
| raise IsADirectoryError(f"Input path is a directory, not a file or glob pattern: {local_path}") |
There was a problem hiding this comment.
So i think we want to allow directory level input paths and we assume everything in that path is part of the input data.
| InputPath = str | PathLike[str] | ||
|
|
||
|
|
||
| def _has_uri_scheme(path: str) -> bool: |
There was a problem hiding this comment.
What pattern were you expecting to leverage this urlparse?
| for input_path in paths: | ||
| raw_path = fspath(input_path) | ||
| pattern = str(Path(raw_path).expanduser()) | ||
| matches = [match for match in glob.glob(pattern, recursive=True) if Path(match).is_file()] | ||
| if matches: | ||
| expanded.extend(sorted(matches)) | ||
| elif _is_explicit_glob_path(pattern): | ||
| expanded.append(pattern) | ||
| elif not Path(pattern).exists(): | ||
| raise_input_path_not_found(pattern) |
There was a problem hiding this comment.
Remote URI paths rejected as missing local files
expand_input_file_patterns performs local-filesystem validation on every non-glob path via Path(pattern).exists(). Any remote URI that RayDataExecutor previously accepted — e.g. s3://my-bucket/docs/file.pdf or gs://... — has no glob magic, returns [] from glob.glob, and fails Path(...).exists(), so raise_input_path_not_found fires before Ray is ever reached. In the old code the same path fell through glob.glob returning [] and was passed unchanged to rd.read_binary_files. This is a regression: callers who use Ray's remote-filesystem support will now get FileNotFoundError("Input path does not exist: s3://...") instead of a working read.
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/utils/input_files.py
Line: 57-66
Comment:
**Remote URI paths rejected as missing local files**
`expand_input_file_patterns` performs local-filesystem validation on every non-glob path via `Path(pattern).exists()`. Any remote URI that `RayDataExecutor` previously accepted — e.g. `s3://my-bucket/docs/file.pdf` or `gs://...` — has no glob magic, returns `[]` from `glob.glob`, and fails `Path(...).exists()`, so `raise_input_path_not_found` fires before Ray is ever reached. In the old code the same path fell through `glob.glob` returning `[]` and was passed unchanged to `rd.read_binary_files`. This is a regression: callers who use Ray's remote-filesystem support will now get `FileNotFoundError("Input path does not exist: s3://...")` instead of a working read.
How can I resolve this? If you propose a fix, please make it concise.
Description
Addresses 6154864
Checklist