feat: Timer for the data sharding and job submission #1802

guyueh1 · 2026-01-21T16:44:03Z

What does this PR do ?

Wrap data sharding and job submission in timer to provide richer timing debugging info.

For example, for the newly added qwen30b 40K seqlen GRPO example, the timing info looks like this

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

Refactor
- Added optional timing instrumentation to training operations across multiple training algorithms
- Extended policy method signatures to support performance metrics tracking during policy training and inference
- Timing context is conditionally applied without affecting existing functionality

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

coderabbitai · 2026-01-21T19:00:50Z

📝 Walkthrough

Walkthrough

This pull request adds optional timing instrumentation to policy training and inference methods across multiple algorithms by introducing timer and timer_tag_prefix parameters to method signatures and wrapping key operations with timer context managers.

Changes

Cohort / File(s)	Summary
Algorithm Updates `nemo_rl/algorithms/distillation.py`, `nemo_rl/algorithms/dpo.py`, `nemo_rl/algorithms/grpo.py`, `nemo_rl/algorithms/sft.py`	Added timer and timer_tag_prefix parameters to policy method calls (get_logprobs, get_reference_policy_logprobs, get_topk_logits, train) to enable granular timing measurements for inference and training operations. Changes are consistent across files with no modifications to control flow or error handling.
Interface Definitions `nemo_rl/models/policy/interfaces.py`	Extended PolicyInterface method signatures (get_logprobs, get_reference_policy_logprobs, train) to accept **kwargs for forwarding arbitrary keyword arguments including timer parameters.
Policy Implementation `nemo_rl/models/policy/lm_policy.py`	Implemented timer support in Policy class methods by adding optional Timer parameters, conditionally wrapping data sharding and futures submission operations in timer context managers using nullcontext guards. Added Timer and nullcontext imports.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR makes significant changes to core modules across multiple algorithms without documenting test results or validation data to demonstrate that timing instrumentation changes do not introduce regressions or affect algorithm behavior.	Add test results or testing information to PR description demonstrating existing unit tests pass, no performance regression occurs, convergence behavior remains unchanged, and new timing instrumentation works correctly.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding timer instrumentation to data sharding and job submission operations throughout the codebase.
Docstring Coverage	✅ Passed	Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_rl/models/policy/interfaces.py (1)

54-112: Update all PolicyInterface implementers to accept **kwargs in get_logprobs, get_reference_policy_logprobs, and train methods.

The interface now accepts arbitrary keyword arguments in these three methods, but no concrete implementation does. This will cause TypeError when these methods are called through the interface with any unexpected keyword arguments.

Implementations to update:

Policy (lm_policy.py): Currently accepts explicit timer and timer_tag_prefix parameters; add **kwargs or refactor to use kwargs

MegatronPolicyWorker, DTensorPolicyWorker, DTensorPolicyWorkerV2: Add **kwargs to get_logprobs and train

BasePolicyWorker: Add **kwargs to get_reference_policy_logprobs

guyueh1 · 2026-01-22T04:04:40Z

@youngeunkwon0405 please review

youngeunkwon0405

LGTM

guyueh1 added 3 commits January 20, 2026 08:30

Add timer to data sharding and job submission in policy

2c63e4a

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

Merge branch 'main' into time_data_sharding_and_job_submission

1531c50

Add more timers

9b1c749

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

guyueh1 requested review from a team as code owners January 21, 2026 16:44

guyueh1 self-assigned this Jan 21, 2026

guyueh1 added the CI:L2 Run doctests, unit tests, functional tests, and convergence tests label Jan 21, 2026

guyueh1 temporarily deployed to nemo-ci January 21, 2026 16:44 — with GitHub Actions Inactive

guyueh1 had a problem deploying to nemo-ci January 21, 2026 16:48 — with GitHub Actions Failure

guyueh1 had a problem deploying to nemo-ci January 21, 2026 17:22 — with GitHub Actions Failure

Merge branch 'main' into time_data_sharding_and_job_submission

a14631e

guyueh1 added CI:L2 Run doctests, unit tests, functional tests, and convergence tests and removed CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Jan 21, 2026

guyueh1 temporarily deployed to nemo-ci January 21, 2026 18:52 — with GitHub Actions Inactive

guyueh1 temporarily deployed to nemo-ci January 21, 2026 18:55 — with GitHub Actions Inactive

coderabbitai bot reviewed Jan 21, 2026

View reviewed changes

guyueh1 temporarily deployed to nemo-ci January 21, 2026 21:30 — with GitHub Actions Inactive

Merge branch 'main' into time_data_sharding_and_job_submission

c1f41d6

guyueh1 added CI:L2 Run doctests, unit tests, functional tests, and convergence tests and removed CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Jan 22, 2026

guyueh1 temporarily deployed to nemo-ci January 22, 2026 04:04 — with GitHub Actions Inactive

guyueh1 requested a review from youngeunkwon0405 January 22, 2026 04:04

guyueh1 temporarily deployed to nemo-ci January 22, 2026 05:55 — with GitHub Actions Inactive

youngeunkwon0405 approved these changes Jan 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Timer for the data sharding and job submission #1802

feat: Timer for the data sharding and job submission #1802

guyueh1 commented Jan 21, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 21, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

guyueh1 commented Jan 22, 2026

Uh oh!

youngeunkwon0405 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Timer for the data sharding and job submission #1802

Are you sure you want to change the base?

feat: Timer for the data sharding and job submission #1802

Conversation

guyueh1 commented Jan 21, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 21, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

guyueh1 commented Jan 22, 2026

Uh oh!

youngeunkwon0405 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

guyueh1 commented Jan 21, 2026 •

edited by coderabbitai bot

Loading