feat: skip logprob and reference logprob computation under certain conditions by guyueh1 · Pull Request #1891 · NVIDIA-NeMo/RL

guyueh1 · 2026-02-06T04:32:35Z

What does this PR do ?

Skip logprob and reference logprob computation under certain conditions:

when loss_fn.skip_reference_policy_logprobs_calculation=true, skip reference logprob. The requirement is loss_fn.reference_kl_penalty == 0 which will be checked whenever skip_reference_policy_logprobs_calculation is true.
when loss_fn.force_on_policy_ratio=true, skip logprob computation. The requirement is rollout batch size == train global batch size, which will be checked whenever force_on_policy_ratio is true.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

Release Notes

New Features
- Added configuration options to optimize GRPO training: ability to skip reference policy logprob calculations and enforce on-policy ratio in loss function computations.
- New flags reduce computational overhead during training while maintaining training stability.

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

coderabbitai · 2026-02-11T22:20:06Z

📝 Walkthrough

Walkthrough

This PR introduces configurable flags to optimize GRPO training by enabling optional skipping of logprob computations. Configuration files are updated with force_on_policy_ratio and skip_reference_policy_logprobs_calculation flags, while the algorithm implementation adds conditional gating for logprob computation in GRPO training and loss function evaluation.

Changes

Cohort / File(s)	Summary
Configuration Updates `examples/configs/recipes/llm/dapo-qwen2.5-7b.yaml`, `examples/configs/recipes/llm/performance/dapo-deepseek-v3-64n8g.yaml`	Added `skip_reference_policy_logprobs_calculation: true` flag under grpo section to enable skipping reference policy logprob calculations.
GRPO Recipe Configurations `examples/configs/recipes/llm/grpo-llama3.2-1b-instruct-1n8g-fsdp2tp1.v3.yaml`, `examples/configs/recipes/llm/grpo-llama3.2-1b-instruct-1n8g-megatron.yaml`, `examples/configs/recipes/llm/grpo-llama3.2-1b-instruct-1n8g-megatron_generation.yaml`	Added `loss_fn.force_on_policy_ratio: true` configuration block under grpo section to enforce on-policy ratio in loss computation.
GRPO Algorithm Implementation `nemo_rl/algorithms/grpo.py`	Added conditional gating for logprob computations based on `force_on_policy_ratio` and `skip_reference_policy_logprobs_calculation` flags; sets `train_data.prev_logprobs` to zeros when skipping prev_logprobs computation.
Loss Function Logic `nemo_rl/algorithms/loss_functions.py`	Modified `ClippedPGLossFn.__call__` to compute on-policy `curr_logprobs` internally when `force_on_policy_ratio` is enabled; handles distributed computing scenarios (vocab_parallel_group, DTensor) with appropriate vocab range and padding adjustments; sets `prev_logprobs` to computed `curr_logprobs` for on-policy behavior.

Sequence Diagram

sequenceDiagram
    participant TrainingLoop as GRPO Training Loop
    participant LogprobCalc as Logprob Calculation
    participant LossFunc as Loss Function
    participant DataPrep as Data Preparation

    TrainingLoop->>DataPrep: Prepare training data
    
    alt skip_reference_policy_logprobs_calculation == false
        DataPrep->>LogprobCalc: Compute reference_policy_logprobs
        LogprobCalc-->>DataPrep: reference_policy_logprobs
    else
        DataPrep-->>DataPrep: Skip reference logprob computation
    end
    
    alt skip_prev_logprobs == false
        DataPrep->>LogprobCalc: Compute prev_logprobs
        LogprobCalc-->>DataPrep: prev_logprobs
    else
        DataPrep-->>DataPrep: Set prev_logprobs to zeros
    end
    
    DataPrep->>LossFunc: Pass data with logprobs
    
    alt force_on_policy_ratio == true
        LossFunc->>LossFunc: Compute curr_logprobs on-policy
        LossFunc->>LossFunc: Override prev_logprobs with curr_logprobs
    else
        LossFunc->>LossFunc: Use provided prev_logprobs
    end
    
    LossFunc-->>TrainingLoop: Compute loss
    TrainingLoop->>TrainingLoop: Backpropagate

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

feat: force on-policy ratio to 1 #1529: Directly implements the force_on_policy_ratio feature with corresponding grpo and loss_functions changes for on-policy behavior enforcement.
feat: Implement ProRLv2 recipe #1809: Related algorithm changes to grpo.py and loss_functions.py for logprob-based advantage computation and importance-sampling behavior.

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	❓ Inconclusive	The PR contains significant changes to loss function computation and logprob handling, but no test results or testing documentation are visible in the provided context.	Search for test results documentation in PR comments, CI/CD logs, or attached test reports to confirm whether testing was performed.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'skip logprob and reference logprob computation under certain conditions' accurately describes the main changes in the PR, which add skip flags and conditional gating for logprob computations across config files and algorithm implementations.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@nemo_rl/algorithms/grpo.py`:
- Around line 1579-1597: The code currently zero-fills
train_data["prev_logprobs"] when force_on_policy_ratio is True which leads to
misleading logs and plots (token_mult_prob_error); change the handling so that
when master_config["loss_fn"].get("force_on_policy_ratio", False) is True you
either (A) avoid emitting prev_logprobs into log_data and skip plotting
token_mult_prob_error, or (B) back-fill train_data["prev_logprobs"] with the
actual on-policy probabilities returned by the training step (e.g., use
train_results["curr_logprobs"] / .detach() if present) before any
logging/visualization; update the code paths around prev_logprobs,
train_results, log_data["prev_logprobs"], and token_mult_prob_error to implement
one of these behaviors.

🧹 Nitpick comments (3)

nemo_rl/algorithms/grpo.py (3)

337-337: Explain why NRL_IGNORE_TP_ACCURACY_CHECK is needed when force_on_policy_ratio is enabled.

Setting a global environment variable as a side effect of a config flag is opaque. Consider adding a comment explaining why the TP accuracy check must be disabled here, so future maintainers understand the coupling.

1602-1621: Minor: logprob_data is allocated even when both logprob computations are skipped.

When both skip_prev_logprobs and skip_reference_policy_logprobs are True, the logprob_data dict on lines 1602–1608 is created but never read. This is lightweight (just references, no tensor copies), so it's not a real concern — just a nit for clarity.

2601-2633: Duplicated skip-logic between grpo_train and async_grpo_train.

Lines 2601–2633 are nearly identical to lines 1579–1621 in grpo_train. Consider extracting the flag resolution and conditional prepare_for_lp_inference / logprob gating into a shared helper to keep both paths in sync and reduce maintenance burden.

youngeunkwon0405 · 2026-02-12T08:37:30Z

I have two questions related to logprob skipping.

For the reference logprob, shouldn't we always skip the reference logprob when the loss_fn.reference_kl_penalty == 0? Why would we need an additional argument loss_fn.skip_reference_policy_logprobs_calculation?
For the prev_logprob, I think in theory (please correct me if I am wrong), if the generation model and policy model are numerically identical, then generation_logprob == prev_logprob and we could always skip the prev_logprob calculation unless we need to report the mult_prob_error, gen_kl_error like metrics. I think having an argument like loss_fn.use_generation_logprob could be more intuitive (if it is true and seq_logprob_error_threshold==false, skip the prev_logprob calculation).
For the force_on_policy_ratio, I think it was just asserting that the training batch size is equal to the generation batch size. So, it can use the current_logprob to skip prev_logprob calculation. I think we can just skip the prev_logprob calculation if train_bs==gen_bs and seq_logprob_error_threshold==false and no other features requiring log prob error stuff then skip the prev_logprob calculation.

guyueh1 · 2026-02-13T21:31:38Z

I have two questions related to logprob skipping.

For the reference logprob, shouldn't we always skip the reference logprob when the loss_fn.reference_kl_penalty == 0? Why would we need an additional argument loss_fn.skip_reference_policy_logprobs_calculation?

For the prev_logprob, I think in theory (please correct me if I am wrong), if the generation model and policy model are numerically identical, then generation_logprob == prev_logprob and we could always skip the prev_logprob calculation unless we need to report the mult_prob_error, gen_kl_error like metrics. I think having an argument like loss_fn.use_generation_logprob could be more intuitive (if it is true and seq_logprob_error_threshold==false, skip the prev_logprob calculation).

For the force_on_policy_ratio, I think it was just asserting that the training batch size is equal to the generation batch size. So, it can use the current_logprob to skip prev_logprob calculation. I think we can just skip the prev_logprob calculation if train_bs==gen_bs and seq_logprob_error_threshold==false and no other features requiring log prob error stuff then skip the prev_logprob calculation.

I think the logic for skip_reference_policy_logprobs_calculation pre-exists in the codebase (

RL/nemo_rl/algorithms/grpo.py

Line 148 in 869b5e5

skip_reference_policy_logprobs_calculation: NotRequired[bool]

) I just refactored it a bit and added the logic to async_grpo_train in this PR. But I think your point is valid, current logic is: user needs to explicitly specify skip_reference_policy_logprobs_calculation to skip the computation, and only when KL-penalty==0 it works, but the correct logic is: this is skipped whenever KL penalty is 0. @terrykong which one is better?

I am still reconciling about 2 but I do agree with 3, when certain conditions are met, we should skip logprob even if user doesn't explicitly specify force_on_policy_ratio, currently we are coupling a perf optimization with an algo feature that user wants to enforce and that's bad. I will do a revise based on 3.

claude · 2026-05-06T00:15:02Z

Review Summary

The logic for skipping logprobs/reference logprobs looks correct — the conditions are consistent between grpo.py and loss_functions.py, and the guards for seq_logprob_error_threshold incompatibility are in place.

Test coverage: The new skip_prev_logprobs and skip_reference_policy_logprobs code paths in both grpo_train and async_grpo_train don't have dedicated tests. The only test change is adding reference_policy_kl_penalty: 0.01 to the async config. Consider adding test cases that exercise:

force_on_policy_ratio=True (verifying prev_logprobs computation is actually skipped and training succeeds)
reference_policy_kl_penalty=0 (verifying reference logprobs are skipped)
The assertion that seq_logprob_error_threshold is incompatible with force_on_policy_ratio=True

See inline comments for minor issues.

Signed-off-by: Guyue Huang <guyueh@login-lyris01.lyris.clusters.nvidia.com>

guyueh1 · 2026-05-06T21:05:53Z

/ok to test 6abdc51

This reverts commit 6abdc51. Signed-off-by: Guyue Huang <guyueh@nvidia.com>

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

guyueh1 · 2026-05-07T17:09:20Z

/ok to test 71f0ff6

guyueh1 · 2026-05-08T22:57:20Z

/claude review

claude · 2026-05-08T23:00:16Z

                print("▶ Computing logprobs...", flush=True)
                with timer.time("policy_and_reference_logprobs"):


nit: When both skip_prev_logprobs and skip_reference_policy_logprobs are true, this still prints "Computing logprobs..." and constructs logprob_data only to immediately delete it. Consider wrapping the entire block (including the print) in the skip check, or adjusting the log message.

i think it's ok to ignore

claude

Review summary

The core logic looks correct — skipping prev_logprobs when force_on_policy_ratio=True and skipping reference logprobs when reference_policy_kl_penalty==0 is sound, and the loss function properly uses curr_logprobs.detach() as the substitute. The guard against seq_logprob_error_threshold when prev_logprobs are skipped is a good catch. Tests cover the loss function layer well.

Two minor items flagged inline:

.get("force_on_policy_ratio", False) in grpo.py (both sync and async paths) uses a hidden boolean default, which violates config conventions. Use .get(key) without a default, or direct access since the exemplar YAML always provides it.
When both logprob computations are skipped, the code still prints "Computing logprobs..." and constructs logprob_data unnecessarily.

One other note: the removal of skip_reference_policy_logprobs_calculation makes the example in skills/config-conventions/SKILL.md (line 57) stale — worth updating to avoid confusion.

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

… into fuse_logprob_train

guyueh1 · 2026-05-12T15:18:19Z

dup of https://github.com/NVIDIA-NeMo/RL/pull/2443/changes

guyueh1 added 2 commits November 12, 2025 08:24

save

dd9498d

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

Merge branch 'main' into fuse_logprob_train

5a3c52e

guyueh1 self-assigned this Feb 6, 2026

guyueh1 added the deepseek Related to deepseek 671b label Feb 6, 2026

guyueh1 added 6 commits February 9, 2026 20:26

Fix merge conflict

48ecb0b

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

Cleaner

a0b58e8

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

fix

4f0f1f4

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

save

8d17ba3

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

save

4a3d561

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

Fix; add the config to recipes

a8156b8

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

guyueh1 marked this pull request as ready for review February 11, 2026 22:17

guyueh1 requested review from a team as code owners February 11, 2026 22:17

guyueh1 added the CI:L1 Run doctests, unit tests, and functional tests label Feb 11, 2026

guyueh1 requested a review from HeyyyyyyG February 11, 2026 22:18

guyueh1 temporarily deployed to nemo-ci February 11, 2026 22:18 — with GitHub Actions Inactive

guyueh1 requested review from gshennvm and youngeunkwon0405 February 11, 2026 22:18

Merge branch 'main' into fuse_logprob_train

1e25d65

guyueh1 changed the title ~~feat: Fuse logprob and train when rollout and train have same batch size~~ feat: skip logprob and reference logprob computation under certain conditions Feb 11, 2026

coderabbitai Bot reviewed Feb 11, 2026

View reviewed changes

Comment thread nemo_rl/algorithms/grpo.py Outdated

guyueh1 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 12, 2026

guyueh1 temporarily deployed to nemo-ci February 12, 2026 18:16 — with GitHub Actions Inactive

guyueh1 temporarily deployed to nemo-ci February 12, 2026 19:49 — with GitHub Actions Inactive

guyueh1 temporarily deployed to nemo-ci February 13, 2026 00:37 — with GitHub Actions Inactive

guyueh1 requested a review from terrykong February 13, 2026 21:28

claude Bot reviewed May 6, 2026

View reviewed changes

Comment thread nemo_rl/algorithms/grpo.py

copy-pr-bot Bot temporarily deployed to nemo-ci May 6, 2026 00:23 Inactive

Address claude comment

6abdc51

Signed-off-by: Guyue Huang <guyueh@login-lyris01.lyris.clusters.nvidia.com>

copy-pr-bot Bot had a problem deploying to nemo-ci May 6, 2026 21:06 Error

copy-pr-bot Bot temporarily deployed to nemo-ci May 6, 2026 21:11 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 6, 2026 21:23 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 6, 2026 22:57 Inactive

guyueh1 added 2 commits May 7, 2026 10:08

Revert "Address claude comment"

edba691

This reverts commit 6abdc51. Signed-off-by: Guyue Huang <guyueh@nvidia.com>

Add unit test

71f0ff6

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

guyueh1 force-pushed the fuse_logprob_train branch from 0c818bf to 71f0ff6 Compare May 7, 2026 17:08

copy-pr-bot Bot temporarily deployed to nemo-ci May 7, 2026 17:09 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 7, 2026 18:57 Inactive

guyueh1 linked an issue May 7, 2026 that may be closed by this pull request

Remove unnecessary model forward passes in logprob phase in GRPO #1186

Closed

copy-pr-bot Bot temporarily deployed to nemo-ci May 7, 2026 20:33 Inactive

guyueh1 requested review from terrykong, yfw and youngeunkwon0405 and removed request for terrykong May 8, 2026 15:02

claude Bot reviewed May 8, 2026

View reviewed changes

Comment thread nemo_rl/algorithms/grpo.py Outdated

claude Bot reviewed May 8, 2026

View reviewed changes

guyueh1 added 3 commits May 12, 2026 08:11

Merge branch 'main' into fuse_logprob_train

39b6202

Use skip prev_logprob in perf recipes

84a737b

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

Merge branch 'fuse_logprob_train' of ssh://github.com/guyueh1/NeMo-RL…

b3c52da

… into fuse_logprob_train

guyueh1 closed this May 12, 2026

		print("▶ Computing logprobs...", flush=True)
		with timer.time("policy_and_reference_logprobs"):

Conversation

guyueh1 commented Feb 6, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

youngeunkwon0405 commented Feb 12, 2026

Uh oh!

guyueh1 commented Feb 13, 2026

Uh oh!

Uh oh!

claude Bot commented May 6, 2026

Review Summary

Uh oh!

guyueh1 commented May 6, 2026

Uh oh!

guyueh1 commented May 7, 2026

Uh oh!

guyueh1 commented May 8, 2026

Uh oh!

Uh oh!

claude Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

guyueh1 May 10, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

guyueh1 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

guyueh1 commented Feb 6, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 11, 2026 •

edited

Loading