[REVIEW] Benchmarking Script for E2E Pipeline by VibhuJawa · Pull Request #1389 · NVIDIA-NeMo/Curator

VibhuJawa · 2026-01-16T23:49:30Z

This pull request adds a new configuration file for running local end-to-end benchmarking of the Common Crawl pipeline. The configuration includes local paths, global settings, and a single enabled entry for executing the pipeline with specific arguments and resource allocations.

Configuration for local benchmarking:

Added benchmarking/local-cc-e2e.yaml to define local paths for results, datasets, and models, as well as global settings like default timeout and scratch deletion.
Configured a single pipeline entry (cc_e2e_pipeline_local) with script arguments, resource limits (CPUs, GPUs, object store size), and Ray-specific settings for local execution.

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

copy-pr-bot · 2026-01-16T23:49:33Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

praateekmahajan · 2026-01-20T20:44:49Z

+  - name: cc_e2e_pipeline_local
+    enabled: true
+    script: cc_e2e_pipeline_benchmark.py
+    args: >-
+      --benchmark-results-path={session_entry_dir}
+      --fasttext_model_path=/raid/vjawa/models/lid.176.bin
+      --download_path={session_entry_dir}/scratch/downloads
+      --output_path={session_entry_dir}/scratch/output
+      --snapshot=2024-30
+      --url_limit=1
+      --record_limit=100
+      --executor=ray_data
+    timeout_s: 3600
+    ray:
+      num_cpus: 4
+      num_gpus: 0
+      enable_object_spilling: false
+    object_store_size_bytes: 11474836480


Instead of a new file can you add it to nightly and then run with --entries cc_e2e_pipeline_local

we probably should have more CPUs to test benchmarking

for url limit = 1, that's only downloading one file.. and if download takes 10 minutes other N-1 cpus would just be idle.. so should probably have url_limit = K * num_cpus to test parallelism and load balancing

for record_limit wdyt of removing it? because we know there is a backpressure issue in Ray Data if the stages are actors, so we can test for that later.. and IIRC when record_limit is higher iterate is slower than download.. i'm hoping in future when we combine iterate and extract we probably see perf benefits

for fasttext-model-path and hf-home we could probably use something in datasets path itself? wdyt?

Addressed the reivew:

Added something like arxiv_e2e_pipeline_*

Moved to a local downloaded tar dump (to simulate the workflow better)

We run on downloaded 45 tar files now, can increase as a PR followup for more larger scale testing as needed

Removed record_limti

Added paths, let me know if that works.

praateekmahajan · 2026-01-20T20:50:02Z

+      num_cpus: 4
+      num_gpus: 0
+      enable_object_spilling: false
+    object_store_size_bytes: 11474836480


Can we add more metrics to track such as number of rows in the beginning and number of rows in the end (exact_value)..
and then probably something for throughput (min_value_

Added more metrics like num_tar_files, num_input_documents, num_output_documents, first, i have not added throughput as a requirment now. Please take a look (because i have not run on the benchmarking machine)

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Copilot

Pull request overview

This pull request adds a comprehensive end-to-end benchmarking script for the ArXiv text processing pipeline, designed for nightly benchmarking with support for both local tar file processing and S3 downloading modes.

Changes:

Added get_aggregated_stage_stats utility function to benchmarking/scripts/utils.py for extracting aggregated performance metrics from pipeline results
Created arxiv_e2e_pipeline_benchmark.py with a full E2E ArXiv processing pipeline including extraction, heuristic filters, quality classifiers, and configurable output formats
Added two new dataset configurations (arxiv_downloads and fasttext_model) and two benchmark entries (arxiv_e2e_pipeline_raydata and arxiv_e2e_pipeline_xenna) to nightly-benchmark.yaml for testing both Ray Data and Xenna executors

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File	Description
benchmarking/scripts/utils.py	Adds helper function for aggregating stage performance statistics by matching stage name prefixes
benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py	Complete E2E benchmark script with custom stages for local tar file processing, comprehensive filtering pipeline, and configurable execution modes
benchmarking/nightly-benchmark.yaml	Adds dataset configurations and two benchmark entries for testing the ArXiv pipeline with different executors

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-22T22:17:53Z

+
+    # Calculate metrics from stage performance data
+    num_tar_files = len(results) if results else 0
+    num_input_documents = get_aggregated_stage_stats(results, "extract_", "num_items_processed")


logic: Stage prefix search "extract_" won't match the actual decomposed stages from LocalArxivExtractStage. When decomposed, the stages are named like extract_arxivextractor (lowercase class name). The search prefix "extract_" will match, but if the exact naming changes or if there are multiple extract stages, this could fail silently and return 0 documents processed. Consider using a more specific or documented stage name pattern, or verify the exact stage names being generated at runtime.

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

praateekmahajan · 2026-01-23T21:19:52Z

+
+    # ========== DOWNLOAD/PROCESSING OPTIONS ==========
+    download_group = p.add_argument_group("Download/Processing Options")
+    download_group.add_argument("--download_path", type=str, default="./arxiv_e2e_downloads")


Super nit, but since argparse is used for cli, and cli commands are kebab-case can we make all these kebab-case?

e.g. download_path -> download-path

e.g. of cli commands when multiworded being kebab case are ls --human-readable or grep --ignore-case

praateekmahajan · 2026-01-23T21:22:05Z

+
+    @staticmethod
+    def get_aggregated_stage_stat(
+        tasks: list[Task] | Mapping[str, list[Task]] | None,


thanks for moving here

nit, can this be list[Task] or WorkflowResult and then we can check if isinstance(x, WorkflowResult): x.whatever

praateekmahajan · 2026-01-23T21:24:07Z

+          - num_output_documents
+          - num_input_documents
+    ray:
+      num_cpus: 64


let's reduce number of CPUs as discussed offline since number of files in offline setting are only 47 to maybe 16.. i'd guess we'll have to change the timeout too

praateekmahajan · 2026-01-23T21:29:29Z

    formats:
    - type: "files"
      path: "{datasets_path}/mscoco/model_weights"
+  - name: "arxiv_downloads"


upto you if you think additional clarity would be helpful, but it'll be nice to document explicitly or internally which batches of arxiv are used
something like pdf/arXiv_pdf_1001_001 to pdf/arXiv_pdf_1001_002 or maybe it's all of pdf/arXiv_pdf_1001_* (if it is then maybe we can rename the name to arxiv_downloads_1001 or to arxiv_downloads_1001_to_1002

Added path for now, lets get this in and clean it up later on

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

sarahyurick · 2026-02-03T19:57:55Z

+  - name: "fasttext_model"
+    formats:
+    - type: "bin"
+      path: "{model_weights_path}/fasttext/lid.176.bin"


For the image and video models we do it like:

- name: "mscoco_model_weights" formats: - type: "files" path: "{datasets_path}/mscoco/model_weights"

and

- name: "videos_model_weights" formats: - type: "files" path: "{datasets_path}/video_model_weights"

so do you think we should just do path: "{datasets_path}/fasttext/lid.176.bin" to match?

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{4 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-04T00:25:49Z

@@ -0,0 +1,500 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.


Copyright year should be 2025, not 2026

Suggested change

# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.

# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

first working commit

96c887d

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

praateekmahajan reviewed Jan 20, 2026

View reviewed changes

Comment thread benchmarking/scripts/cc_e2e_pipeline_benchmark.py Outdated

praateekmahajan reviewed Jan 20, 2026

View reviewed changes

Comment thread benchmarking/scripts/cc_e2e_pipeline_benchmark.py Outdated

praateekmahajan reviewed Jan 20, 2026

View reviewed changes

rlratzel reviewed Jan 20, 2026

View reviewed changes

Comment thread benchmarking/scripts/cc_e2e_pipeline_benchmark.py Outdated

VibhuJawa and others added 4 commits January 21, 2026 14:37

Merge branch 'NVIDIA-NeMo:main' into vjawa/add_benchmarking_scripts

5ec3aad

Modified scripts based on feedback

5f570d5

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

add stuff to nightly-benchmark.yaml

1cd9e95

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

add more custom metrics

32dea05

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

VibhuJawa requested a review from Copilot January 22, 2026 21:17

Copilot started reviewing on behalf of VibhuJawa January 22, 2026 21:17 View session

Copilot AI reviewed Jan 22, 2026

View reviewed changes

Removed -45 limit

8408990

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

VibhuJawa requested a review from Copilot January 22, 2026 21:47

Copilot started reviewing on behalf of VibhuJawa January 22, 2026 21:48 View session

VibhuJawa changed the title ~~[WIP] Benchmarking Script for E2E Pipeline~~ [REVIEW] Benchmarking Script for E2E Pipeline Jan 22, 2026

Copilot AI reviewed Jan 22, 2026

View reviewed changes

Comment thread benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py Outdated

Comment thread benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py Outdated

Comment thread benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py

Removed epilog (not useful)

958d33e

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

VibhuJawa marked this pull request as ready for review January 22, 2026 21:56

copy-pr-bot Bot temporarily deployed to test January 22, 2026 21:57 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci January 22, 2026 21:57 Error

copy-pr-bot Bot temporarily deployed to nemo-ci January 22, 2026 21:57 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci January 22, 2026 21:57 Error

copy-pr-bot Bot temporarily deployed to nemo-ci January 22, 2026 21:57 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci January 22, 2026 21:57 Error

copy-pr-bot Bot temporarily deployed to nemo-ci January 22, 2026 21:57 Inactive

removed the hard coding of arxiv path

be0a328

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

greptile-apps Bot reviewed Jan 22, 2026

View reviewed changes

praateekmahajan reviewed Jan 22, 2026

View reviewed changes

Comment thread benchmarking/scripts/utils.py Outdated

VibhuJawa added 2 commits January 23, 2026 02:10

Add util requested in review

53cc847

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Add util

ee8adb9

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

greptile-apps Bot reviewed Jan 23, 2026

View reviewed changes

Merge branch 'main' into vjawa/add_benchmarking_scripts

9a101d0

greptile-apps Bot reviewed Jan 23, 2026

View reviewed changes

praateekmahajan reviewed Jan 23, 2026

View reviewed changes

Comment thread benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py

praateekmahajan reviewed Jan 23, 2026

View reviewed changes

Comment thread benchmarking/nightly-benchmark.yaml

praateekmahajan reviewed Jan 23, 2026

View reviewed changes

Addressed reviews and ensure stuff is working

b81b2fe

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

greptile-apps Bot reviewed Jan 27, 2026

View reviewed changes

Merge branch 'main' into vjawa/add_benchmarking_scripts

93fa49e

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

greptile-apps Bot reviewed Jan 27, 2026

View reviewed changes

added paths/dir

0367625

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

greptile-apps Bot reviewed Jan 27, 2026

View reviewed changes

sarahyurick mentioned this pull request Feb 3, 2026

Add benchmarking script for FastText filters #1411

Closed

sarahyurick reviewed Feb 3, 2026

View reviewed changes

Merge branch 'main' into vjawa/add_benchmarking_scripts

ad9fafd

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

greptile-apps Bot reviewed Feb 3, 2026

View reviewed changes

Merge branch 'main' into vjawa/add_benchmarking_scripts

75f7b5f

greptile-apps Bot reviewed Feb 3, 2026

View reviewed changes

Comment thread benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py

praateekmahajan approved these changes Feb 3, 2026

View reviewed changes

Merge branch 'main' into vjawa/add_benchmarking_scripts

3333ad3

greptile-apps Bot reviewed Feb 4, 2026

View reviewed changes

		@@ -0,0 +1,500 @@
		# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.

	# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
	# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

Conversation

VibhuJawa commented Jan 16, 2026

Uh oh!

copy-pr-bot Bot commented Jan 16, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VibhuJawa Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

praateekmahajan Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

VibhuJawa Jan 22, 2026 •

edited

Loading

praateekmahajan Jan 23, 2026 •

edited

Loading