[benchmark] Add Video Benchmarks by suiyoubi · Pull Request #1430 · NVIDIA-NeMo/Curator

suiyoubi · 2026-01-26T15:11:16Z

Description

Usage

# Add snippet demonstrating usage

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Ao Tang <aot@nvidia.com>

…benchmark

Signed-off-by: Ao Tang <aot@nvidia.com>

copy-pr-bot · 2026-01-26T15:11:19Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-01-26T15:13:49Z

Greptile Overview

Greptile Summary

This PR adds comprehensive video processing benchmarks to the nightly benchmark suite. The implementation properly reuses existing tutorial code by extracting the argparser and pipeline creation functions into reusable components.

Key changes:

Created video_pipeline_benchmark.py that reuses the video splitting pipeline from tutorials with proper metrics collection (videos processed, clips generated, throughput)
Added 4 benchmark configurations testing different video processing scenarios: embedding generation, transcoding, captioning with enhancement, and TransNetV2 with motion/aesthetic filtering
Refactored video_split_clip_example.py to extract create_video_splitting_argparser() for reuse across scripts
Renamed argument from --output-clip-path to --output-path and updated all README examples accordingly
Added defensive checks for task.data attributes to prevent AttributeError exceptions

Minor issue:

One benchmark requirement has a placeholder value that needs updating after actual benchmarking (line 532)

Confidence Score: 4/5

This PR is safe to merge with minimal risk
The implementation follows established benchmark patterns, properly reuses existing code, includes defensive error handling, and addresses previous review comments. The only concern is a placeholder value that should be updated after benchmarking.
No files require special attention - the placeholder value in benchmarking/nightly-benchmark.yaml:532 can be updated in a follow-up after benchmarking is complete

Important Files Changed

Filename	Overview
benchmarking/scripts/video_pipeline_benchmark.py	new benchmark script that reuses video pipeline from tutorials with proper error handling and metrics collection
benchmarking/nightly-benchmark.yaml	adds video dataset configs and 4 benchmark entries (embedding, transcoding, captioning, transnetv2)
tutorials/video/getting-started/video_split_clip_example.py	refactored to extract argparser function for reuse, renamed `--output-clip-path` to `--output-path`
tutorials/video/getting-started/README.md	updated all examples to use `--output-path` instead of deprecated `--output-clip-path`

Sequence Diagram

sequenceDiagram
    participant User
    participant Benchmark as video_pipeline_benchmark.py
    participant Utils as utils.py
    participant Tutorial as video_split_clip_example.py
    participant Pipeline as Video Pipeline
    participant Executor as Xenna/RayData Executor

    User->>Benchmark: Run benchmark with args
    Benchmark->>Tutorial: create_video_splitting_argparser()
    Tutorial-->>Benchmark: ArgumentParser
    Benchmark->>Benchmark: Add benchmark args (--benchmark-results-path, --executor)
    Benchmark->>Benchmark: parse_args()
    
    Benchmark->>Utils: setup_executor(args.executor)
    Utils-->>Benchmark: Executor instance
    
    Benchmark->>Tutorial: create_video_splitting_pipeline(args)
    Tutorial->>Pipeline: Create Pipeline("video_splitting")
    Tutorial->>Pipeline: Add VideoReader stage
    Tutorial->>Pipeline: Add splitting stage (FixedStride/TransNetV2)
    Tutorial->>Pipeline: Add ClipTranscodingStage
    
    alt Generate Embeddings
        Tutorial->>Pipeline: Add embedding stage (CosmosEmbed1/InternVideo2)
    end
    
    alt Generate Captions
        Tutorial->>Pipeline: Add VideoFrameCaptioningStage
        alt Enhance Captions
            Tutorial->>Pipeline: Add LLMCaptionImprovementStage
        end
    end
    
    alt Motion/Aesthetic Filtering
        Tutorial->>Pipeline: Add VideoMotionFilterStage
        Tutorial->>Pipeline: Add VideoAestheticFilterStage
    end
    
    Tutorial->>Pipeline: Add ClipWriterStage
    Tutorial-->>Benchmark: Pipeline object
    
    Benchmark->>Pipeline: pipeline.run(executor)
    Pipeline->>Executor: Process video tasks
    Executor-->>Pipeline: output_tasks
    Pipeline-->>Benchmark: output_tasks
    
    Benchmark->>Benchmark: Calculate metrics (videos processed, clips generated, throughput)
    Benchmark->>Utils: write_benchmark_results(results, path)
    Utils->>Utils: Write params.json, metrics.json, tasks.pkl
    Utils-->>Benchmark: Success
    
    Benchmark-->>User: Exit code (0=success, 1=failure)

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-26T15:13:53Z

+results_path: /raid/aot/output/curator_benchmark
+datasets_path: /raid/aot/datasets


Check that these paths (/raid/aot/...) are appropriate for the shared benchmark configuration, as they appear specific to a local development environment.

Signed-off-by: Ao Tang <aot@nvidia.com>

greptile-apps

_{2 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-26T17:47:54Z

+
+        # Calculate metrics from output tasks
+        # Count unique videos by their input_video path
+        unique_videos = {task.data.input_video for task in output_tasks if task.data and task.data.input_video}


Potential AttributeError if task.data is None or doesn't have input_video attribute

Suggested change

unique_videos = {task.data.input_video for task in output_tasks if task.data and task.data.input_video}

unique_videos = {task.data.input_video for task in output_tasks if task.data and hasattr(task.data, 'input_video') and task.data.input_video}

greptile-apps · 2026-01-26T17:47:55Z

+        # Count unique videos by their input_video path
+        unique_videos = {task.data.input_video for task in output_tasks if task.data and task.data.input_video}
+        num_videos_processed = len(unique_videos)
+        num_clips_generated = sum(len(task.data.clips) for task in output_tasks if task.data and task.data.clips)


Same defensive check needed here for clips attribute

Suggested change

num_clips_generated = sum(len(task.data.clips) for task in output_tasks if task.data and task.data.clips)

num_clips_generated = sum(len(task.data.clips) for task in output_tasks if task.data and hasattr(task.data, 'clips') and task.data.clips)

Signed-off-by: Ao Tang <aot@nvidia.com>

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Ao Tang <aot@nvidia.com>

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Ao Tang <aot@nvidia.com>

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-27T18:03:41Z

+        unique_videos = {task.data.input_video for task in output_tasks if task.data and task.data.input_video}
+        num_videos_processed = len(unique_videos)
+        num_clips_generated = sum(len(task.data.clips) for task in output_tasks if task.data and task.data.clips)


Need defensive checks for task.data, task.data.input_video, and task.data.clips to handle potential None values or missing attributes more robustly.

Suggested change

unique_videos = {task.data.input_video for task in output_tasks if task.data and task.data.input_video}

num_videos_processed = len(unique_videos)

num_clips_generated = sum(len(task.data.clips) for task in output_tasks if task.data and task.data.clips)

unique_videos = {task.data.input_video for task in output_tasks if task.data and hasattr(task.data, 'input_video') and task.data.input_video}

num_videos_processed = len(unique_videos)

num_clips_generated = sum(len(task.data.clips) for task in output_tasks if task.data and hasattr(task.data, 'clips') and task.data.clips)

Signed-off-by: Ao Tang <aot@nvidia.com>

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-27T21:39:03Z

    parser.add_argument("--video-limit", type=int, default=None, help="Limit the number of videos to read")
    parser.add_argument("--verbose", action="store_true", default=False)
-    parser.add_argument("--output-clip-path", type=str, help="Path to output clips", required=True)
+    parser.add_argument("--output-path", type=str, help="Path to output clips", required=True)


The argument was renamed from --output-clip-path to --output-path, but README.md in this directory still uses the old name in all examples (lines 20, 36, 47, 80). Update the README to use --output-path instead.

praateekmahajan · 2026-01-27T22:18:15Z

+    timeout_s: 1800
+    ray:
+      num_cpus: 64
+      num_gpus: 1


num_gpus = 4

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-27T22:49:03Z

+    requirements:
+      # ensure the total number of documents processed is correct
+      - metric: num_clips_generated
+        exact_value: 300 # TODO: update this value after benchmarking


placeholder value (300) needs updating after actual benchmarking

suiyoubi added 13 commits January 22, 2026 06:49

Add benchmark script

a753a11

Signed-off-by: Ao Tang <aot@nvidia.com>

test yaml

41e5a11

Signed-off-by: Ao Tang <aot@nvidia.com>

enable gpu decoding

809d4c9

Signed-off-by: Ao Tang <aot@nvidia.com>

Add embedding stage

2d4319d

Signed-off-by: Ao Tang <aot@nvidia.com>

run embedding from yaml

f326a21

Signed-off-by: Ao Tang <aot@nvidia.com>

use tutorial pipeline

73aa0c1

Signed-off-by: Ao Tang <aot@nvidia.com>

adapt tutoiral

075df6b

Signed-off-by: Ao Tang <aot@nvidia.com>

benchmark fix

f153fcf

Signed-off-by: Ao Tang <aot@nvidia.com>

Merge branch 'main' of github.com:NVIDIA-NeMo/Curator into aot/video-…

cb70bb3

…benchmark

update path

f306b55

Signed-off-by: Ao Tang <aot@nvidia.com>

captioning benchmark

16a9bfd

Signed-off-by: Ao Tang <aot@nvidia.com>

Add transnetv2 benchmark

2c6c336

Signed-off-by: Ao Tang <aot@nvidia.com>

fix ruff

7e836bd

Signed-off-by: Ao Tang <aot@nvidia.com>

greptile-apps Bot reviewed Jan 26, 2026

View reviewed changes

suiyoubi added 3 commits January 26, 2026 09:43

revert

fb59c58

Signed-off-by: Ao Tang <aot@nvidia.com>

indent

44108ce

Signed-off-by: Ao Tang <aot@nvidia.com>

fix empty sinks

f6f8c64

Signed-off-by: Ao Tang <aot@nvidia.com>

greptile-apps Bot reviewed Jan 26, 2026

View reviewed changes

coverage for all stages

a2b3867

Signed-off-by: Ao Tang <aot@nvidia.com>

greptile-apps Bot reviewed Jan 27, 2026

View reviewed changes

add model_path

aa72152

Signed-off-by: Ao Tang <aot@nvidia.com>

greptile-apps Bot reviewed Jan 27, 2026

View reviewed changes

add throughput requirement

6640cde

Signed-off-by: Ao Tang <aot@nvidia.com>

greptile-apps Bot reviewed Jan 27, 2026

View reviewed changes

benchmark value revised

b00a631

Signed-off-by: Ao Tang <aot@nvidia.com>

greptile-apps Bot reviewed Jan 27, 2026

View reviewed changes

github-actions Bot added the community-request label Jan 27, 2026

praateekmahajan reviewed Jan 27, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to nemo-ci January 27, 2026 22:47 Inactive

greptile-apps Bot reviewed Jan 27, 2026

View reviewed changes

praateekmahajan merged commit c022a7e into main Jan 27, 2026
50 checks passed

sarahyurick mentioned this pull request Feb 11, 2026

Add relevant 26.02 docs to r1.1.0 #1493

Merged

44 tasks

		results_path: /raid/aot/output/curator_benchmark
		datasets_path: /raid/aot/datasets

	unique_videos = {task.data.input_video for task in output_tasks if task.data and task.data.input_video}
	unique_videos = {task.data.input_video for task in output_tasks if task.data and hasattr(task.data, 'input_video') and task.data.input_video}

	num_clips_generated = sum(len(task.data.clips) for task in output_tasks if task.data and task.data.clips)
	num_clips_generated = sum(len(task.data.clips) for task in output_tasks if task.data and hasattr(task.data, 'clips') and task.data.clips)

Conversation

suiyoubi commented Jan 26, 2026

Description

Usage

Checklist

Uh oh!

copy-pr-bot Bot commented Jan 26, 2026

Uh oh!

greptile-apps Bot commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

praateekmahajan Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Jan 26, 2026 •

edited

Loading