Refactor dataset loading, use of named tuple, and updating the available datasets everywhere by sahel-sh · Pull Request #2001 · NVIDIA/NeMo-Retriever

sahel-sh · 2026-05-08T16:11:49Z

Description

This PR updates the load dataset function name to include BRIGHT in addition to ViDoRe and adds a named tuple for returned loaded objects. It also factors out the repeated dataset filtering code in a helper function in the cli, and updates the outdated example for list-dataset.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

Signed-off-by: Sahel Sharifymoghaddam <ssharifymogh@nvidia.com>

…iever into cleanup_dataset_utils

copy-pr-bot · 2026-05-08T16:11:53Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-05-08T16:14:43Z

Greptile Summary

This PR renames load_vidore_dataset to load_benchmark_dataset, wraps its return value in a BenchmarkDatasetBundle NamedTuple, and extracts a _resolve_selected_datasets helper to eliminate three duplicated dataset-filter blocks. Help text and docstrings are updated throughout to reflect BRIGHT dataset support alongside ViDoRe.

NamedTuple return type: BenchmarkDatasetBundle replaces the bare 8-tuple returned by both _load_bright_split and the renamed loader, improving call-site readability and IDE support.
Dataset filter refactor: _resolve_selected_datasets deduplicates ~25 lines that were copy-pasted into compare_results, report_llm_usage, and purge_llm_error_traces.
Breaking rename: load_vidore_dataset is removed from __all__ in both retrieval_bench/__init__.py and pipeline_evaluation/__init__.py without a backward-compatible alias, which will break any existing caller that imports the old symbol.

Confidence Score: 4/5

Safe to merge after confirming no external consumers rely on the old load_vidore_dataset symbol, or after adding a deprecation alias.

The rename of load_vidore_dataset to load_benchmark_dataset is a hard removal from all in both the top-level and sub-package init files with no backward-compatible alias. Any script or notebook doing from retrieval_bench import load_vidore_dataset will fail with an ImportError after this lands. All other changes — the NamedTuple wrapper, the extracted helper, and the docstring updates — are clean and correct.

retrieval-bench/src/retrieval_bench/init.py and retrieval-bench/src/retrieval_bench/pipeline_evaluation/init.py — both drop load_vidore_dataset from their exported interface without an alias.

Important Files Changed

Filename	Overview
retrieval-bench/src/retrieval_bench/pipeline_evaluation/dataset_loader.py	Adds BenchmarkDatasetBundle NamedTuple; renames load_vidore_dataset to load_benchmark_dataset; fixes Optional[str] type annotation. Logic unchanged.
retrieval-bench/src/retrieval_bench/init.py	Updates top-level all export from load_vidore_dataset to load_benchmark_dataset; no deprecation alias provided — breaks any external caller using the old name.
retrieval-bench/src/retrieval_bench/cli/pipeline_evaluation.py	Extracts _resolve_selected_datasets helper to eliminate 3x duplicated dataset-filter logic; updates help text for BRIGHT; no logic changes.
retrieval-bench/src/retrieval_bench/cli/evaluate.py	Switches from tuple unpacking to NamedTuple attribute access after load_benchmark_dataset call; clean mechanical update.
retrieval-bench/src/retrieval_bench/pipeline_evaluation/init.py	Updates internal all to export load_benchmark_dataset; same breaking-rename concern applies at this sub-package level.
retrieval-bench/src/retrieval_bench/pipelines/init.py	Docstring update only — changes 'vidore-benchmark evaluation' to 'benchmark evaluation'.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["load_benchmark_dataset(dataset_name, split, language)"] --> B{starts with 'bright/'?}
    B -- Yes --> C["_load_bright_split(task, split, language)"]
    B -- No --> D["_upstream_load_vidore_dataset(...)"]
    C --> E["BenchmarkDatasetBundle(...)"]
    D --> F["add excluded_ids_by_query placeholder"]
    F --> E
    E --> G["Callers: evaluate.py via dataset.query_ids, dataset.queries, ..."]

    H["_resolve_selected_datasets(all_datasets, datasets)"] --> I{all_datasets or no filter?}
    I -- Yes --> J["return all ds_full"]
    I -- No --> K["tokenize --datasets CSV"]
    K --> L{tok in short_by_full or full_by_short?}
    L -- Yes --> M["append canonical name"]
    L -- No --> N["raise typer.BadParameter"]
    M --> O["return selected, short_by_full"]
    J --> O

Prompt To Fix All With AI

Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
retrieval-bench/src/retrieval_bench/__init__.py:22-29
**Breaking public API rename without deprecation alias**

`load_vidore_dataset` was the exported public symbol in `__all__` and is now fully removed with no backward-compatible alias. Any downstream code (notebooks, scripts, other packages) doing `from retrieval_bench import load_vidore_dataset` will get an `ImportError` immediately on upgrade. The codebase's backward-compatibility rule requires a deprecation cycle for renames rather than a hard removal. Adding `load_vidore_dataset = load_benchmark_dataset` alongside a deprecation warning in `dataset_loader.py` would preserve compatibility without blocking the rename.

_{Reviews (3): Last reviewed commit: "Merge branch 'main' into cleanup_dataset..." | Re-trigger Greptile}

greptile-apps · 2026-05-08T16:14:49Z

+class BenchmarkDatasetBundle(NamedTuple):
+    query_ids: List[str]
+    queries: List[str]
+    corpus_ids: List[str]
+    corpus_images: List[Any]
+    corpus_texts: List[str]
+    qrels: Dict[str, Dict[str, int]]
+    query_languages: Dict[str, str]
+    excluded_ids_by_query: Dict[str, List[str]]


Missing tests for new public API
The PR checklist acknowledges that tests are not added. BenchmarkDatasetBundle and the renamed load_benchmark_dataset (including its BRIGHT routing path) introduce new logic—the _load_bright_split data-validation path in particular has several error branches (No queries found, No relevance judgments, excluded_ids overlaps gold_ids) that are untested. Per the test-coverage-new-code rule, unit tests covering the happy path and these error branches should accompany this change.

Rule Used: New functionality must include corresponding unit ... (source)

Prompt To Fix With AI

This is a comment left during a code review. Path: retrieval-bench/src/retrieval_bench/pipeline_evaluation/dataset_loader.py Line: 39-47 Comment: **Missing tests for new public API** The PR checklist acknowledges that tests are not added. `BenchmarkDatasetBundle` and the renamed `load_benchmark_dataset` (including its BRIGHT routing path) introduce new logic—the `_load_bright_split` data-validation path in particular has several error branches (`No queries found`, `No relevance judgments`, `excluded_ids overlaps gold_ids`) that are untested. Per the `test-coverage-new-code` rule, unit tests covering the happy path and these error branches should accompany this change. **Rule Used:** New functionality must include corresponding unit ... ([source](https://app.greptile.com/review/custom-context?memory=test-coverage-new-code)) How can I resolve this? If you propose a fix, please make it concise.

This is not a new logic, it is just a name change from load_vidore_dataset to load_benchmark_dataset. _load_bright_split is not a new function

_load_bright_split call inside this function is not a new code, the change only renames load_vidore_dataset to load_benchmark_dataset

Signed-off-by: Sahel Sharifymoghaddam <ssharifymogh@nvidia.com>

greptile-apps · 2026-05-08T18:25:42Z

 __all__ = [
    "BasePipeline",
    "evaluate_retrieval",
    "aggregate_results",
-    "load_vidore_dataset",
+    "load_benchmark_dataset",
    "get_available_datasets",
    "print_dataset_info",
 ]


Breaking public API rename without deprecation alias

load_vidore_dataset was the exported public symbol in __all__ and is now fully removed with no backward-compatible alias. Any downstream code (notebooks, scripts, other packages) doing from retrieval_bench import load_vidore_dataset will get an ImportError immediately on upgrade. The codebase's backward-compatibility rule requires a deprecation cycle for renames rather than a hard removal. Adding load_vidore_dataset = load_benchmark_dataset alongside a deprecation warning in dataset_loader.py would preserve compatibility without blocking the rename.

Prompt To Fix With AI

This is a comment left during a code review. Path: retrieval-bench/src/retrieval_bench/__init__.py Line: 22-29 Comment: **Breaking public API rename without deprecation alias** `load_vidore_dataset` was the exported public symbol in `__all__` and is now fully removed with no backward-compatible alias. Any downstream code (notebooks, scripts, other packages) doing `from retrieval_bench import load_vidore_dataset` will get an `ImportError` immediately on upgrade. The codebase's backward-compatibility rule requires a deprecation cycle for renames rather than a hard removal. Adding `load_vidore_dataset = load_benchmark_dataset` alongside a deprecation warning in `dataset_loader.py` would preserve compatibility without blocking the rename. How can I resolve this? If you propose a fix, please make it concise.

@oliverholworthy should we have a proper deprecation for this name?
I expect most of the usage to be through the cli commands rather than direct imports

sahel-sh added 3 commits May 8, 2026 15:56

clean up dataset loading

ceb1846

clean up dataset loading

bd21b44

Signed-off-by: Sahel Sharifymoghaddam <ssharifymogh@nvidia.com>

Merge branch 'cleanup_dataset_utils' of github.com:sahel-sh/NeMo-Retr…

ac03af7

…iever into cleanup_dataset_utils

sahel-sh requested review from a team as code owners May 8, 2026 16:11

sahel-sh requested a review from edknv May 8, 2026 16:11

greptile-apps Bot reviewed May 8, 2026

View reviewed changes

addressed bot review comments

905a4be

Signed-off-by: Sahel Sharifymoghaddam <ssharifymogh@nvidia.com>

boliu61 requested a review from oliverholworthy May 8, 2026 16:31

sahel-sh self-assigned this May 8, 2026

sahel-sh added 3 commits May 8, 2026 18:15

more cli updates

297d819

Signed-off-by: Sahel Sharifymoghaddam <ssharifymogh@nvidia.com>

Merge branch 'update_cli_desc' into cleanup_dataset_utils

ed171f1

Signed-off-by: Sahel Sharifymoghaddam <ssharifymogh@nvidia.com>

Merge branch 'main' into cleanup_dataset_utils

a8954cd

greptile-apps Bot reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor dataset loading, use of named tuple, and updating the available datasets everywhere#2001

Refactor dataset loading, use of named tuple, and updating the available datasets everywhere#2001
sahel-sh wants to merge 7 commits intoNVIDIA:mainfrom
sahel-sh:cleanup_dataset_utils

sahel-sh commented May 8, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 8, 2026

Uh oh!

greptile-apps Bot commented May 8, 2026 •

edited

Loading

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot May 8, 2026

Uh oh!

sahel-sh May 8, 2026

Uh oh!

sahel-sh May 8, 2026

Uh oh!

greptile-apps Bot May 8, 2026

Uh oh!

sahel-sh May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sahel-sh commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

copy-pr-bot Bot commented May 8, 2026

Uh oh!

greptile-apps Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

sahel-sh May 8, 2026

Choose a reason for hiding this comment

Uh oh!

sahel-sh May 8, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

sahel-sh May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sahel-sh commented May 8, 2026 •

edited

Loading

greptile-apps Bot commented May 8, 2026 •

edited

Loading