fix: fix skip_reference_policy_logprobs_calculation and skip_prev_logprobs by jinglinglingling · Pull Request #2443 · NVIDIA-NeMo/RL

jinglinglingling · 2026-05-08T09:39:45Z

Related issues

Summary

Consolidates three open PRs onto current main and addresses review
feedback so they can ship together cleanly:

fix: skip prev_logprobs computation when force_on_policy_ratio is true #2177 — skip prev_logprobs computation when force_on_policy_ratio=True
fix: skip_reference_policy_logprobs_calculation=true crashes training #2174 — guard the GRPO training loop against calling policy.get_reference_policy_logprobs() when the reference model was never loaded (issue skip_reference_policy_logprobs_calculation=true crashes training with RuntimeError / NameError #1968 Bug 1)
fix: skip loading reference model when KL penalty is zero #2178 — derive init_reference_model = (kl_penalty > 0) to skip loading the reference model when KL penalty is zero

Changes on top of those three PRs

Move the skip_reference_policy_logprobs_calculation assert into setup() so misconfiguration fails before any GPU work (per @yfw on fix: skip prev_logprobs computation when force_on_policy_ratio is true #2177).
Auto-enable grpo.skip_reference_policy_logprobs_calculation=True when loss_fn.reference_policy_kl_penalty == 0, so existing recipes that have kl_penalty=0 without explicitly setting the skip flag (e.g. examples/configs/recipes/llm/grpo-deepscaler-1.5b-8K.yaml) stop crashing inside use_reference_model() (per @yuki-97 / @terrykong on fix: skip loading reference model when KL penalty is zero #2178).
Add two parametrized unit tests in tests/unit/algorithms/test_grpo.py covering both grpo_train and async_grpo_train:
- test_grpo_train_skips_reference_policy_logprobs_when_configured guards skip_reference_policy_logprobs_calculation=true crashes training with RuntimeError / NameError #1968 / fix: skip_reference_policy_logprobs_calculation=true crashes training #2174 / fix: skip loading reference model when KL penalty is zero #2178.
- test_grpo_train_skips_prev_logprobs_when_force_on_policy_ratio guards fix: skip prev_logprobs computation when force_on_policy_ratio is true #2177.
Drop the worker-level guards from fix: skip_reference_policy_logprobs_calculation=true crashes training #2174 — the grpo.py-level skip already prevents the bad call paths, so the worker-layer fallbacks are dead code.

copy-pr-bot · 2026-05-08T09:39:48Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

When force_on_policy_ratio=True, the importance sampling ratio is forced to 1.0, so prev_logprobs are unnecessary. Skip the expensive prepare_for_lp_inference() and get_logprobs() calls in both sync and async GRPO paths. In the loss function, use curr_logprobs.detach() as prev_logprobs instead of loading placeholder zeros from data. Also guards against incompatible use of seq_logprob_error_threshold with force_on_policy_ratio (the threshold requires real prev_logprobs). Part of NVIDIA-NeMo#1906 Co-Authored-By: Jiaqi Zeng <jiaqiz@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: Linglin Jing <linglinj@nvidia.com>

Run ruff format 0.9.9 (matches .pre-commit-config.yaml) on the files touched by the previous commit so the rebased branch passes the format hook on current main. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Linglin Jing <linglinj@nvidia.com>

…t into setup The assert that `loss_fn.reference_policy_kl_penalty == 0` whenever `grpo.skip_reference_policy_logprobs_calculation=True` was previously checked deep inside `grpo_train`, after policy/cluster construction. Move it into `setup()` next to the existing `force_on_policy_ratio` validation so misconfigured runs fail fast, before any expensive initialization. Also attach an explanatory message to the assert so the failure mode is self-describing. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Linglin Jing <linglinj@nvidia.com>

Fixes NVIDIA-NeMo#1968: Setting skip_reference_policy_logprobs_calculation=true with reference_policy_kl_penalty=0 crashed training in three ways: Bug 1: use_reference_model() context manager crash when reference model was never initialized (AttributeError on reference_state_dict). Fix: Added early-return guard in use_reference_model() for all three worker types (megatron, dtensor v1, dtensor v2) - yields without swapping when reference model is None/missing. Bug 2: Async GRPO path unconditionally called get_reference_policy_logprobs() without checking the skip flag. Fix: Added the same skip guard as the sync path, setting zeros_like for reference_policy_logprobs when skipping. Bug 3: Missing reference_policy_logprobs key in train_data causing shape mismatches downstream in loss computation. Fix: Both sync and async paths now explicitly set train_data['reference_policy_logprobs'] = zeros_like(prev_logprobs) when skipping. Also added a _has_reference_model() helper and zeros fallback in base_policy_worker.get_reference_policy_logprobs() as defense-in-depth. Signed-off-by: Linglin Jing <linglinj@nvidia.com>

Cherry-picked PR NVIDIA-NeMo#2174 didn't run ruff format on the worker files it touched. This commit applies the format pass so subsequent diffs stay clean. No functional changes. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Linglin Jing <linglinj@nvidia.com>

When reference_policy_kl_penalty is 0, the reference model is unused during GRPO training. Pass init_reference_model=False to avoid allocating memory for the reference model weights. Closes NVIDIA-NeMo#1957 Co-Authored-By: Jiaqi Zeng <jiaqiz@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: Linglin Jing <linglinj@nvidia.com>

Addresses review on PR NVIDIA-NeMo#2178 (yuki-97, terrykong): - yuki-97: "shall we set skip_reference_policy_logprobs_calculation to True in this situation? otherwise I guess we will get error when calling get_reference_policy_logprobs." - terrykong: lists existing recipes that have reference_policy_kl_penalty=0 without setting the skip flag and would AttributeError after NVIDIA-NeMo#2178. Adds a small auto-derive block right after PR NVIDIA-NeMo#2178's `init_reference_model = ...` line: when the reference model is not loaded, set `skip_reference_policy_logprobs_calculation=True` so the sync/async training loops do not call `get_reference_policy_logprobs()` on a non-existent reference model (issue NVIDIA-NeMo#1968 Bug 1). The existing setup() assert (skip=True => kl_penalty must be 0) is unchanged; together with this auto-derive, the bidirectional invariant kl_penalty == 0 <=> ref model not loaded <=> skip ref logprobs holds for any user-provided combination of the two flags. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Linglin Jing <linglinj@nvidia.com>

Adds a functional smoke test for the path enabled by PR NVIDIA-NeMo#2178 plus the auto-skip safety net added in response to yuki-97's review: > and I think it's better to add a functional test (or modify one > exist functional test) for reference_policy_kl_penalty == 0. The test runs a 2-step GRPO with reference_policy_kl_penalty=0 and without explicitly setting skip_reference_policy_logprobs_calculation, then asserts: * the auto-skip log line fires (proves setup() override worked); * the existing "Reference policy logprob calculation will be skipped" confirmation log fires; * standard probs_ratio + gen_kl_error metric envelopes pass (PR NVIDIA-NeMo#2174 zeros placeholder keeps loss math valid when KL penalty is zero). Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Linglin Jing <linglinj@nvidia.com>

…ratio Adds two parametrized unit tests in tests/unit/algorithms/test_grpo.py that cover both grpo_train and async_grpo_train: - test_grpo_train_skips_reference_policy_logprobs_when_configured: guards issue NVIDIA-NeMo#1968 / PRs NVIDIA-NeMo#2174, NVIDIA-NeMo#2178 by asserting that policy.get_reference_policy_logprobs is never called when grpo.skip_reference_policy_logprobs_calculation=True. - test_grpo_train_skips_prev_logprobs_when_force_on_policy_ratio: guards PR NVIDIA-NeMo#2177 by asserting that policy.get_logprobs is never called when loss_fn.force_on_policy_ratio=True. Both tests reuse the existing mock_grpo_components fixture and the mock_async_grpo_infrastructure helper so they require no GPU / Ray cluster and run in CI in milliseconds (modulo cold-start import cost). Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Linglin Jing <linglinj@nvidia.com>

Per review on the consolidation PR: the early-return guards added in nemo_rl/models/policy/workers/{base,dtensor,dtensor_v2,megatron}_policy_worker.py are redundant. The grpo.py setup() now auto-enables grpo.skip_reference_policy_logprobs_calculation when loss_fn.reference_policy_kl_penalty == 0, and the sync/async training loops both gate the policy.get_reference_policy_logprobs() call on that flag. This means the worker layer is never asked for reference logprobs when the reference model is not loaded, so the worker-level guards never fire. Also removes tests/functional/grpo_kl_zero.sh -- the four parametrized unit tests in tests/unit/algorithms/test_grpo.py (test_grpo_train_skips_reference_policy_logprobs_when_configured and test_grpo_train_skips_prev_logprobs_when_force_on_policy_ratio, each across grpo_train + async_grpo_train) cover the same skip-paths without needing GPUs or a real cluster. Signed-off-by: Linglin Jing <linglinj@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Linglin Jing <linglinj@nvidia.com>

jinglinglingling · 2026-05-09T02:43:31Z

/ok to test d5c1532

The auto-skip logic added in setup() (auto-enabling skip_reference_policy_logprobs_calculation when KL=0) reads master_config["loss_fn"]["reference_policy_kl_penalty"], so the mock config in test_setup_sglang_sets_model_path_and_parallel_flag must include this key. Fixes KeyError seen in L0_Unit_Tests_Other CI. Signed-off-by: Linglin Jing <linglinj@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>

jinglinglingling · 2026-05-09T06:00:56Z

/ok to test 92aa6ba

The two regression tests added in this PR drive `grpo_train` / `async_grpo_train` through code paths that call `torch.zeros_like(prev_logprobs)` (PRs NVIDIA-NeMo#2174 / NVIDIA-NeMo#2178) and `torch.zeros_like(generation_logprobs)` (PR NVIDIA-NeMo#2177). Under the bare `mock_grpo_components` fixture those inputs are `MagicMock` objects, so CI failed with `TypeError: zeros_like(): argument 'input' (position 1) must be Tensor, not MagicMock` at `nemo_rl/algorithms/grpo.py:1801`. Add a `_patched_logprob_phase` context manager that swaps in real tensors for `policy.get_logprobs`, `policy.get_reference_policy_logprobs`, and `batched_message_log_to_flat_message`, and use it in both the sync and async branches of the two new tests. Signed-off-by: Linglin Jing <linglinj@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>

jinglinglingling · 2026-05-09T07:50:40Z

/ok to test c447a0d

yuki-97

lgtm, thanks @jinglinglingling . one minor comment.

@yfw could you help to take a review as well?

Co-authored-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

yuki-97 · 2026-05-12T03:52:57Z

/ok to test 2c57451

…probs (NVIDIA-NeMo#2443) Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: Linglin Jing <linglinj@nvidia.com> Co-authored-by: Yi-Fu Wu <yifu.wu@gmail.com> Co-authored-by: Jiaqi Zeng <jiaqiz@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Nemo Assist <nemo-assist@nvidia.com> Co-authored-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: zswerth <zwertheimer@nvidia.com>

…probs (#2443) Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: Linglin Jing <linglinj@nvidia.com> Co-authored-by: Yi-Fu Wu <yifu.wu@gmail.com> Co-authored-by: Jiaqi Zeng <jiaqiz@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Nemo Assist <nemo-assist@nvidia.com> Co-authored-by: Yuki Huang <yukih@nvidia.com>

jinglinglingling requested review from a team as code owners May 8, 2026 09:39

jinglinglingling added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label May 8, 2026

yfw and others added 10 commits May 8, 2026 19:22

jinglinglingling force-pushed the skip-prev-logprobs-force-onpolicy-rebased branch from a10b32e to d5c1532 Compare May 9, 2026 02:23

jinglinglingling changed the title ~~fix(grpo): skip unused logprob computations (rebases #2177, integrates #2174 + #2178)~~ fix: fix skip_reference_policy_logprobs_calculation and skip_prev_logprobs May 9, 2026

NVIDIA-NeMo deleted a comment from copy-pr-bot Bot May 9, 2026

copy-pr-bot Bot temporarily deployed to nemo-ci May 9, 2026 02:44 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 9, 2026 06:01 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 9, 2026 07:51 Inactive

yuki-97 reviewed May 11, 2026

View reviewed changes

Comment thread nemo_rl/algorithms/grpo.py Outdated

yuki-97 requested a review from yfw May 11, 2026 05:29

Apply suggestions from code review

2c57451

Co-authored-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

yfw approved these changes May 11, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to nemo-ci May 12, 2026 03:53 Inactive

yuki-97 enabled auto-merge (squash) May 12, 2026 03:53

yuki-97 approved these changes May 12, 2026

View reviewed changes

yuki-97 merged commit e7266a9 into NVIDIA-NeMo:main May 12, 2026
27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: fix skip_reference_policy_logprobs_calculation and skip_prev_logprobs#2443

fix: fix skip_reference_policy_logprobs_calculation and skip_prev_logprobs#2443
yuki-97 merged 13 commits into
NVIDIA-NeMo:mainfrom
jinglinglingling:skip-prev-logprobs-force-onpolicy-rebased

jinglinglingling commented May 8, 2026 •

edited by yuki-97

Loading

Uh oh!

copy-pr-bot Bot commented May 8, 2026

Uh oh!

jinglinglingling commented May 9, 2026

Uh oh!

jinglinglingling commented May 9, 2026

Uh oh!

jinglinglingling commented May 9, 2026

Uh oh!

yuki-97 left a comment

Uh oh!

Uh oh!

yuki-97 commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jinglinglingling commented May 8, 2026 • edited by yuki-97 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related issues

Summary

Changes on top of those three PRs

Uh oh!

copy-pr-bot Bot commented May 8, 2026

Uh oh!

jinglinglingling commented May 9, 2026

Uh oh!

jinglinglingling commented May 9, 2026

Uh oh!

jinglinglingling commented May 9, 2026

Uh oh!

yuki-97 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yuki-97 commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jinglinglingling commented May 8, 2026 •

edited by yuki-97

Loading