Skip to content

[REVIEW] Benchmarking Script for E2E Pipeline#1389

Merged
VibhuJawa merged 19 commits intoNVIDIA-NeMo:mainfrom
VibhuJawa:vjawa/add_benchmarking_scripts
Feb 4, 2026
Merged

[REVIEW] Benchmarking Script for E2E Pipeline#1389
VibhuJawa merged 19 commits intoNVIDIA-NeMo:mainfrom
VibhuJawa:vjawa/add_benchmarking_scripts

Conversation

@VibhuJawa
Copy link
Copy Markdown
Contributor

This pull request adds a new configuration file for running local end-to-end benchmarking of the Common Crawl pipeline. The configuration includes local paths, global settings, and a single enabled entry for executing the pipeline with specific arguments and resource allocations.

Configuration for local benchmarking:

  • Added benchmarking/local-cc-e2e.yaml to define local paths for results, datasets, and models, as well as global settings like default timeout and scratch deletion.
  • Configured a single pipeline entry (cc_e2e_pipeline_local) with script arguments, resource limits (CPUs, GPUs, object store size), and Ray-specific settings for local execution.

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jan 16, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Comment thread benchmarking/local-cc-e2e.yaml Outdated
Comment on lines +18 to +35
- name: cc_e2e_pipeline_local
enabled: true
script: cc_e2e_pipeline_benchmark.py
args: >-
--benchmark-results-path={session_entry_dir}
--fasttext_model_path=/raid/vjawa/models/lid.176.bin
--download_path={session_entry_dir}/scratch/downloads
--output_path={session_entry_dir}/scratch/output
--snapshot=2024-30
--url_limit=1
--record_limit=100
--executor=ray_data
timeout_s: 3600
ray:
num_cpus: 4
num_gpus: 0
enable_object_spilling: false
object_store_size_bytes: 11474836480
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Instead of a new file can you add it to nightly and then run with --entries cc_e2e_pipeline_local
  • we probably should have more CPUs to test benchmarking
  • for url limit = 1, that's only downloading one file.. and if download takes 10 minutes other N-1 cpus would just be idle.. so should probably have url_limit = K * num_cpus to test parallelism and load balancing
  • for record_limit wdyt of removing it? because we know there is a backpressure issue in Ray Data if the stages are actors, so we can test for that later.. and IIRC when record_limit is higher iterate is slower than download.. i'm hoping in future when we combine iterate and extract we probably see perf benefits

for fasttext-model-path and hf-home we could probably use something in datasets path itself? wdyt?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed the reivew:

  • Added something like arxiv_e2e_pipeline_*
  • Moved to a local downloaded tar dump (to simulate the workflow better)
  • We run on downloaded 45 tar files now, can increase as a PR followup for more larger scale testing as needed
  • Removed record_limti
  • Added paths, let me know if that works.

Comment thread benchmarking/scripts/cc_e2e_pipeline_benchmark.py Outdated
Comment thread benchmarking/scripts/cc_e2e_pipeline_benchmark.py Outdated
Comment thread benchmarking/local-cc-e2e.yaml Outdated
num_cpus: 4
num_gpus: 0
enable_object_spilling: false
object_store_size_bytes: 11474836480
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add more metrics to track such as number of rows in the beginning and number of rows in the end (exact_value)..
and then probably something for throughput (min_value_

Copy link
Copy Markdown
Contributor Author

@VibhuJawa VibhuJawa Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more metrics like num_tar_files, num_input_documents, num_output_documents, first, i have not added throughput as a requirment now. Please take a look (because i have not run on the benchmarking machine)

Comment thread benchmarking/scripts/cc_e2e_pipeline_benchmark.py Outdated
VibhuJawa and others added 4 commits January 21, 2026 14:37
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds a comprehensive end-to-end benchmarking script for the ArXiv text processing pipeline, designed for nightly benchmarking with support for both local tar file processing and S3 downloading modes.

Changes:

  • Added get_aggregated_stage_stats utility function to benchmarking/scripts/utils.py for extracting aggregated performance metrics from pipeline results
  • Created arxiv_e2e_pipeline_benchmark.py with a full E2E ArXiv processing pipeline including extraction, heuristic filters, quality classifiers, and configurable output formats
  • Added two new dataset configurations (arxiv_downloads and fasttext_model) and two benchmark entries (arxiv_e2e_pipeline_raydata and arxiv_e2e_pipeline_xenna) to nightly-benchmark.yaml for testing both Ray Data and Xenna executors

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
benchmarking/scripts/utils.py Adds helper function for aggregating stage performance statistics by matching stage name prefixes
benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py Complete E2E benchmark script with custom stages for local tar file processing, comprehensive filtering pipeline, and configurable execution modes
benchmarking/nightly-benchmark.yaml Adds dataset configurations and two benchmark entries for testing the ArXiv pipeline with different executors

Comment thread benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py Outdated
Comment thread benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py Outdated
Comment thread benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py Outdated
Comment thread benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py
Comment thread benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py Outdated
Comment thread benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py Outdated
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
@VibhuJawa VibhuJawa requested a review from Copilot January 22, 2026 21:47
@VibhuJawa VibhuJawa changed the title [WIP] Benchmarking Script for E2E Pipeline [REVIEW] Benchmarking Script for E2E Pipeline Jan 22, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comment thread benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py Outdated
Comment thread benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py Outdated
Comment thread benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile


# Calculate metrics from stage performance data
num_tar_files = len(results) if results else 0
num_input_documents = get_aggregated_stage_stats(results, "extract_", "num_items_processed")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Stage prefix search "extract_" won't match the actual decomposed stages from LocalArxivExtractStage. When decomposed, the stages are named like extract_arxivextractor (lowercase class name). The search prefix "extract_" will match, but if the exact naming changes or if there are multiple extract stages, this could fail silently and return 0 documents processed. Consider using a more specific or documented stage name pattern, or verify the exact stage names being generated at runtime.

Comment thread benchmarking/scripts/utils.py Outdated
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Comment thread benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py

# ========== DOWNLOAD/PROCESSING OPTIONS ==========
download_group = p.add_argument_group("Download/Processing Options")
download_group.add_argument("--download_path", type=str, default="./arxiv_e2e_downloads")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nit, but since argparse is used for cli, and cli commands are kebab-case can we make all these kebab-case?

e.g. download_path -> download-path

e.g. of cli commands when multiworded being kebab case are ls --human-readable or grep --ignore-case

Comment thread benchmarking/nightly-benchmark.yaml
Comment thread nemo_curator/tasks/utils.py Outdated

@staticmethod
def get_aggregated_stage_stat(
tasks: list[Task] | Mapping[str, list[Task]] | None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for moving here

nit, can this be list[Task] or WorkflowResult and then we can check if isinstance(x, WorkflowResult): x.whatever

Comment thread benchmarking/nightly-benchmark.yaml Outdated
- num_output_documents
- num_input_documents
ray:
num_cpus: 64
Copy link
Copy Markdown
Contributor

@praateekmahajan praateekmahajan Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's reduce number of CPUs as discussed offline since number of files in offline setting are only 47 to maybe 16.. i'd guess we'll have to change the timeout too

formats:
- type: "files"
path: "{datasets_path}/mscoco/model_weights"
- name: "arxiv_downloads"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

upto you if you think additional clarity would be helpful, but it'll be nice to document explicitly or internally which batches of arxiv are used
something like pdf/arXiv_pdf_1001_001 to pdf/arXiv_pdf_1001_002 or maybe it's all of pdf/arXiv_pdf_1001_* (if it is then maybe we can rename the name to arxiv_downloads_1001 or to arxiv_downloads_1001_to_1002

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added path for now, lets get this in and clean it up later on

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Comment on lines +50 to +53
- name: "fasttext_model"
formats:
- type: "bin"
path: "{model_weights_path}/fasttext/lid.176.bin"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the image and video models we do it like:

  - name: "mscoco_model_weights"
    formats:
    - type: "files"
      path: "{datasets_path}/mscoco/model_weights"

and

  - name: "videos_model_weights"
    formats:
    - type: "files"
      path: "{datasets_path}/video_model_weights"

so do you think we should just do path: "{datasets_path}/fasttext/lid.176.bin" to match?

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@@ -0,0 +1,500 @@
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copyright year should be 2025, not 2026

Suggested change
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants