[REVIEW] Benchmarking Script for E2E Pipeline#1389
[REVIEW] Benchmarking Script for E2E Pipeline#1389VibhuJawa merged 19 commits intoNVIDIA-NeMo:mainfrom
Conversation
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
| - name: cc_e2e_pipeline_local | ||
| enabled: true | ||
| script: cc_e2e_pipeline_benchmark.py | ||
| args: >- | ||
| --benchmark-results-path={session_entry_dir} | ||
| --fasttext_model_path=/raid/vjawa/models/lid.176.bin | ||
| --download_path={session_entry_dir}/scratch/downloads | ||
| --output_path={session_entry_dir}/scratch/output | ||
| --snapshot=2024-30 | ||
| --url_limit=1 | ||
| --record_limit=100 | ||
| --executor=ray_data | ||
| timeout_s: 3600 | ||
| ray: | ||
| num_cpus: 4 | ||
| num_gpus: 0 | ||
| enable_object_spilling: false | ||
| object_store_size_bytes: 11474836480 |
There was a problem hiding this comment.
- Instead of a new file can you add it to nightly and then run with
--entries cc_e2e_pipeline_local - we probably should have more CPUs to test benchmarking
- for url limit = 1, that's only downloading one file.. and if download takes 10 minutes other N-1 cpus would just be idle.. so should probably have
url_limit = K * num_cpusto test parallelism and load balancing - for
record_limitwdyt of removing it? because we know there is a backpressure issue in Ray Data if the stages are actors, so we can test for that later.. and IIRC when record_limit is higher iterate is slower than download.. i'm hoping in future when we combine iterate and extract we probably see perf benefits
for fasttext-model-path and hf-home we could probably use something in datasets path itself? wdyt?
There was a problem hiding this comment.
Addressed the reivew:
- Added something like arxiv_e2e_pipeline_*
- Moved to a local downloaded tar dump (to simulate the workflow better)
- We run on downloaded 45 tar files now, can increase as a PR followup for more larger scale testing as needed
- Removed record_limti
- Added paths, let me know if that works.
| num_cpus: 4 | ||
| num_gpus: 0 | ||
| enable_object_spilling: false | ||
| object_store_size_bytes: 11474836480 |
There was a problem hiding this comment.
Can we add more metrics to track such as number of rows in the beginning and number of rows in the end (exact_value)..
and then probably something for throughput (min_value_
There was a problem hiding this comment.
Added more metrics like num_tar_files, num_input_documents, num_output_documents, first, i have not added throughput as a requirment now. Please take a look (because i have not run on the benchmarking machine)
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
There was a problem hiding this comment.
Pull request overview
This pull request adds a comprehensive end-to-end benchmarking script for the ArXiv text processing pipeline, designed for nightly benchmarking with support for both local tar file processing and S3 downloading modes.
Changes:
- Added
get_aggregated_stage_statsutility function tobenchmarking/scripts/utils.pyfor extracting aggregated performance metrics from pipeline results - Created
arxiv_e2e_pipeline_benchmark.pywith a full E2E ArXiv processing pipeline including extraction, heuristic filters, quality classifiers, and configurable output formats - Added two new dataset configurations (
arxiv_downloadsandfasttext_model) and two benchmark entries (arxiv_e2e_pipeline_raydataandarxiv_e2e_pipeline_xenna) tonightly-benchmark.yamlfor testing both Ray Data and Xenna executors
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| benchmarking/scripts/utils.py | Adds helper function for aggregating stage performance statistics by matching stage name prefixes |
| benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py | Complete E2E benchmark script with custom stages for local tar file processing, comprehensive filtering pipeline, and configurable execution modes |
| benchmarking/nightly-benchmark.yaml | Adds dataset configurations and two benchmark entries for testing the ArXiv pipeline with different executors |
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
|
|
||
| # Calculate metrics from stage performance data | ||
| num_tar_files = len(results) if results else 0 | ||
| num_input_documents = get_aggregated_stage_stats(results, "extract_", "num_items_processed") |
There was a problem hiding this comment.
logic: Stage prefix search "extract_" won't match the actual decomposed stages from LocalArxivExtractStage. When decomposed, the stages are named like extract_arxivextractor (lowercase class name). The search prefix "extract_" will match, but if the exact naming changes or if there are multiple extract stages, this could fail silently and return 0 documents processed. Consider using a more specific or documented stage name pattern, or verify the exact stage names being generated at runtime.
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
|
|
||
| # ========== DOWNLOAD/PROCESSING OPTIONS ========== | ||
| download_group = p.add_argument_group("Download/Processing Options") | ||
| download_group.add_argument("--download_path", type=str, default="./arxiv_e2e_downloads") |
There was a problem hiding this comment.
Super nit, but since argparse is used for cli, and cli commands are kebab-case can we make all these kebab-case?
e.g. download_path -> download-path
e.g. of cli commands when multiworded being kebab case are ls --human-readable or grep --ignore-case
|
|
||
| @staticmethod | ||
| def get_aggregated_stage_stat( | ||
| tasks: list[Task] | Mapping[str, list[Task]] | None, |
There was a problem hiding this comment.
thanks for moving here
nit, can this be list[Task] or WorkflowResult and then we can check if isinstance(x, WorkflowResult): x.whatever
| - num_output_documents | ||
| - num_input_documents | ||
| ray: | ||
| num_cpus: 64 |
There was a problem hiding this comment.
let's reduce number of CPUs as discussed offline since number of files in offline setting are only 47 to maybe 16.. i'd guess we'll have to change the timeout too
| formats: | ||
| - type: "files" | ||
| path: "{datasets_path}/mscoco/model_weights" | ||
| - name: "arxiv_downloads" |
There was a problem hiding this comment.
upto you if you think additional clarity would be helpful, but it'll be nice to document explicitly or internally which batches of arxiv are used
something like pdf/arXiv_pdf_1001_001 to pdf/arXiv_pdf_1001_002 or maybe it's all of pdf/arXiv_pdf_1001_* (if it is then maybe we can rename the name to arxiv_downloads_1001 or to arxiv_downloads_1001_to_1002
There was a problem hiding this comment.
Added path for now, lets get this in and clean it up later on
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
| - name: "fasttext_model" | ||
| formats: | ||
| - type: "bin" | ||
| path: "{model_weights_path}/fasttext/lid.176.bin" |
There was a problem hiding this comment.
For the image and video models we do it like:
- name: "mscoco_model_weights"
formats:
- type: "files"
path: "{datasets_path}/mscoco/model_weights"
and
- name: "videos_model_weights"
formats:
- type: "files"
path: "{datasets_path}/video_model_weights"
so do you think we should just do path: "{datasets_path}/fasttext/lid.176.bin" to match?
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
| @@ -0,0 +1,500 @@ | |||
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
Copyright year should be 2025, not 2026
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | |
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. |
This pull request adds a new configuration file for running local end-to-end benchmarking of the Common Crawl pipeline. The configuration includes local paths, global settings, and a single enabled entry for executing the pipeline with specific arguments and resource allocations.
Configuration for local benchmarking:
benchmarking/local-cc-e2e.yamlto define local paths for results, datasets, and models, as well as global settings like default timeout and scratch deletion.cc_e2e_pipeline_local) with script arguments, resource limits (CPUs, GPUs, object store size), and Ray-specific settings for local execution.