Fix RolloutManager reward normalization for uneven rollout groups#1918
Open
haoyang9804 wants to merge 1 commit into
Open
Fix RolloutManager reward normalization for uneven rollout groups#1918haoyang9804 wants to merge 1 commit into
haoyang9804 wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
RolloutManager._post_process_rewardscan silently corrupt GRPO/GSPO reward normalization when the flattened rollout batch contains uneven sample counts per prompt group. The fallback path reshapes the whole reward vector into a single row withrewards.view(-1, rewards.shape[-1]), so samples from unrelatedSample.group_indexgroups share one mean and standard deviation before the rewards are sent to training.This patch normalizes rewards by rollout group using
Sample.group_index, with positional chunks as a fallback when group ids are absent. Singleton groups stay finite and centered at zero instead of borrowing statistics from another group. A regression test covers the uneven group case.Concrete triggering example
Configuration:
Flattened rollout samples:
[ Sample(group_index=0, index=0, reward=1.0, tokens=[10, 11], response_length=1, loss_mask=[1]), Sample(group_index=0, index=1, reward=3.0, tokens=[10, 12], response_length=1, loss_mask=[1]), Sample(group_index=1, index=2, reward=5.0, tokens=[10, 13], response_length=1, loss_mask=[1]), ]Wrong value on current
main:{ "train_data_rewards": [-0.9999995231628418, 0.0, 0.9999995231628418], "policy_loss_mean": -0.06666664034128189, "clipfrac": [0.0, 0.0, 1.0] }Fixed value:
{ "train_data_rewards": [-0.7071062922477722, 0.7071062922477722, 0.0], "policy_loss_mean": 0.06108969449996948, "clipfrac": [0.0, 0.0, 0.0] }The bad values are finite, so the run does not crash, but the reward and policy-loss signal changes before the first update.
Intuitive prompt-group example
Consider one rollout minibatch where prompt A returns two samples and prompt B
returns one sample:
The old fallback loses that grouping and normalizes the flattened rewards as if
they came from one prompt:
That gives prompt B a positive normalized reward only because prompt A's rewards
were present in the same batch. Group-preserving normalization keeps prompt B's
singleton group centered at zero:
Reproduction recipe
{ "kind": "rl_sentinel_validation_recipe", "schema_version": 1, "bug_id": "SLIME-ROLLMGR-UNEVEN-REWARD-NORM", "command": "PYTHONPATH=${TARGET_REPO} BUG_ID=SLIME-ROLLMGR-UNEVEN-REWARD-NORM VALIDATION_DIR=${OUTPUT_DIR} python ${OUTPUT_DIR}/run_validation.py", "boundary": "slime.ray.rollout.RolloutManager._convert_samples_to_train_data -> _post_process_rewards", "trigger_data": { "args": { "advantage_estimator": "grpo", "rewards_normalization": true, "grpo_std_normalization": true, "n_samples_per_prompt": 2, "rollout_batch_size": 2, "reward_key": null }, "samples": [ {"group_index": 0, "index": 0, "reward": 1.0, "tokens": [10, 11], "response_length": 1, "loss_mask": [1]}, {"group_index": 0, "index": 1, "reward": 3.0, "tokens": [10, 12], "response_length": 1, "loss_mask": [1]}, {"group_index": 1, "index": 2, "reward": 5.0, "tokens": [10, 13], "response_length": 1, "loss_mask": [1]} ] } }Validation runner
Observed output
On unfixed
main:{ "status": "reproduced", "observed": { "train_data_rewards": [-0.9999995231628418, 0.0, 0.9999995231628418], "policy_loss_mean": -0.06666664034128189, "clipfrac": [0.0, 0.0, 1.0] }, "expected": { "train_data_rewards": [-0.7071062922477722, 0.7071062922477722, 0.0], "policy_loss_mean": 0.06108969449996948, "clipfrac": [0.0, 0.0, 0.0] } }On this branch:
{ "status": "fixed", "delta_after_fix": { "max_reward_abs_delta": 0.0, "max_policy_loss_abs_delta": 0.0 } }Root cause
The fallback for uneven reward counts used:
That creates one normalization row containing every remaining sample. The function already receives
Sampleobjects withgroup_index, so the fallback can preserve rollout group identity instead of mixing unrelated groups.Fix
The patch builds grouped sample indices from
Sample.group_index, falling back to positional chunks ofn_samples_per_promptonly when a sample has no group id. Each group is mean-centered independently. For GRPO/GSPO std normalization, multi-sample groups keep the existing unbiasedtorch.stdbehavior, while singleton groups are returned as zero after centering so they stay finite.Tests and checks
Results:
Contribution and duplicate checks
Upstream repo:
THUDM/slime.Contribution file read:
CONTRIBUTING.md. It welcomes bug fixes with clear reproduction and tests, which matches this change.Duplicate checks performed:
BUG_FINDINGS.mdsearched forRolloutManager,_post_process_rewards, uneven reward groups, singleton reward normalization, and reward-group contamination.${RL_SENTINEL_ARTIFACTS_ROOT}/testing-loop/slime-watch-20260518-002939-2.pr_drafts/searched.fix/gspo-masked-old-logprob-nan, which fixes a different GSPO masked old-logprob NaN invariant.ghwas unavailable in this environment, so no fresh upstream PR result is claimed.mainstill contained therewards.view(-1, rewards.shape[-1])fallback before this patch.Related PRs or fixes
Related same-family findings in the local ledger cover reward contamination in other projects and masked invalid-value bugs. They are not exact duplicates because this issue is specific to slime
RolloutManager._post_process_rewardsmixing uneven rollout reward groups before training conversion. The prior branchfix/gspo-masked-old-logprob-nanis also not a duplicate; it addresses masked old logprobs in GSPO KL calculation rather than group reward normalization.