Skip to content

Fix RolloutManager reward normalization for uneven rollout groups#1918

Open
haoyang9804 wants to merge 1 commit into
THUDM:mainfrom
haoyang9804:fix/rollmgr-uneven-reward-norm
Open

Fix RolloutManager reward normalization for uneven rollout groups#1918
haoyang9804 wants to merge 1 commit into
THUDM:mainfrom
haoyang9804:fix/rollmgr-uneven-reward-norm

Conversation

@haoyang9804
Copy link
Copy Markdown

Summary

RolloutManager._post_process_rewards can silently corrupt GRPO/GSPO reward normalization when the flattened rollout batch contains uneven sample counts per prompt group. The fallback path reshapes the whole reward vector into a single row with rewards.view(-1, rewards.shape[-1]), so samples from unrelated Sample.group_index groups share one mean and standard deviation before the rewards are sent to training.

This patch normalizes rewards by rollout group using Sample.group_index, with positional chunks as a fallback when group ids are absent. Singleton groups stay finite and centered at zero instead of borrowing statistics from another group. A regression test covers the uneven group case.

Concrete triggering example

Configuration:

args = Namespace(
    advantage_estimator="grpo",
    rewards_normalization=True,
    grpo_std_normalization=True,
    n_samples_per_prompt=2,
    rollout_batch_size=2,
    reward_key=None,
)

Flattened rollout samples:

[
    Sample(group_index=0, index=0, reward=1.0, tokens=[10, 11], response_length=1, loss_mask=[1]),
    Sample(group_index=0, index=1, reward=3.0, tokens=[10, 12], response_length=1, loss_mask=[1]),
    Sample(group_index=1, index=2, reward=5.0, tokens=[10, 13], response_length=1, loss_mask=[1]),
]

Wrong value on current main:

{
  "train_data_rewards": [-0.9999995231628418, 0.0, 0.9999995231628418],
  "policy_loss_mean": -0.06666664034128189,
  "clipfrac": [0.0, 0.0, 1.0]
}

Fixed value:

{
  "train_data_rewards": [-0.7071062922477722, 0.7071062922477722, 0.0],
  "policy_loss_mean": 0.06108969449996948,
  "clipfrac": [0.0, 0.0, 0.0]
}

The bad values are finite, so the run does not crash, but the reward and policy-loss signal changes before the first update.

Intuitive prompt-group example

Consider one rollout minibatch where prompt A returns two samples and prompt B
returns one sample:

sample      prompt group      raw reward      intended normalization group
A0          0                 1.0             [1.0, 3.0]
A1          0                 3.0             [1.0, 3.0]
B0          1                 5.0             [5.0]

The old fallback loses that grouping and normalizes the flattened rewards as if
they came from one prompt:

flattened rewards:        [1.0, 3.0, 5.0]
old normalized rewards:   [-1.0, 0.0, 1.0]

That gives prompt B a positive normalized reward only because prompt A's rewards
were present in the same batch. Group-preserving normalization keeps prompt B's
singleton group centered at zero:

group 0 rewards:          [1.0, 3.0] -> [-0.7071, 0.7071]
group 1 rewards:          [5.0]      -> [0.0]
fixed normalized rewards: [-0.7071, 0.7071, 0.0]

Reproduction recipe

{
  "kind": "rl_sentinel_validation_recipe",
  "schema_version": 1,
  "bug_id": "SLIME-ROLLMGR-UNEVEN-REWARD-NORM",
  "command": "PYTHONPATH=${TARGET_REPO} BUG_ID=SLIME-ROLLMGR-UNEVEN-REWARD-NORM VALIDATION_DIR=${OUTPUT_DIR} python ${OUTPUT_DIR}/run_validation.py",
  "boundary": "slime.ray.rollout.RolloutManager._convert_samples_to_train_data -> _post_process_rewards",
  "trigger_data": {
    "args": {
      "advantage_estimator": "grpo",
      "rewards_normalization": true,
      "grpo_std_normalization": true,
      "n_samples_per_prompt": 2,
      "rollout_batch_size": 2,
      "reward_key": null
    },
    "samples": [
      {"group_index": 0, "index": 0, "reward": 1.0, "tokens": [10, 11], "response_length": 1, "loss_mask": [1]},
      {"group_index": 0, "index": 1, "reward": 3.0, "tokens": [10, 12], "response_length": 1, "loss_mask": [1]},
      {"group_index": 1, "index": 2, "reward": 5.0, "tokens": [10, 13], "response_length": 1, "loss_mask": [1]}
    ]
  }
}

Validation runner

#!/usr/bin/env python3
import contextlib
import json
import os
import sys
from argparse import Namespace
from collections import OrderedDict
from pathlib import Path

import torch

BUG_ID = os.environ.get("BUG_ID", "SLIME-ROLLMGR-UNEVEN-REWARD-NORM")
OUTPUT_DIR = Path(os.environ["VALIDATION_DIR"])
TARGET_REPO = os.environ["TARGET_REPO"]

sys.path.insert(0, TARGET_REPO)

with contextlib.redirect_stdout(sys.stderr):
    from slime.ray.rollout import RolloutManager
    from slime.utils.ppo_utils import compute_policy_loss
    from slime.utils.types import Sample

RolloutManagerClass = RolloutManager.__ray_metadata__.modified_class
manager = object.__new__(RolloutManagerClass)
manager.custom_reward_post_process_func = None
manager.custom_convert_samples_to_train_data_func = None
manager.args = Namespace(
    advantage_estimator="grpo",
    rewards_normalization=True,
    grpo_std_normalization=True,
    n_samples_per_prompt=2,
    rollout_batch_size=2,
    reward_key=None,
)

samples = [
    Sample(group_index=0, index=0, reward=1.0, tokens=[10, 11], response_length=1, loss_mask=[1]),
    Sample(group_index=0, index=1, reward=3.0, tokens=[10, 12], response_length=1, loss_mask=[1]),
    Sample(group_index=1, index=2, reward=5.0, tokens=[10, 13], response_length=1, loss_mask=[1]),
]

train_data = manager._convert_samples_to_train_data(samples)
observed_rewards = [float(x) for x in train_data["rewards"]]

def normalize_group(vals):
    tensor = torch.tensor(vals, dtype=torch.float32)
    centered = tensor - tensor.mean()
    if tensor.numel() <= 1:
        return torch.zeros_like(centered)
    return centered / (centered.std(unbiased=True) + 1e-6)

expected_by_index = {}
groups = OrderedDict()
for sample in samples:
    groups.setdefault(sample.group_index, []).append(sample)
for group_samples in groups.values():
    normalized = normalize_group([float(sample.reward) for sample in group_samples])
    for sample, value in zip(group_samples, normalized.tolist(), strict=True):
        expected_by_index[sample.index] = float(value)
expected_rewards = [expected_by_index[sample.index] for sample in samples]

observed_tensor = torch.tensor(observed_rewards, dtype=torch.float32)
expected_tensor = torch.tensor(expected_rewards, dtype=torch.float32)
ppo_kl = torch.tensor([0.0, 0.3, -0.2], dtype=torch.float32)
observed_pg, observed_clip = compute_policy_loss(ppo_kl, observed_tensor, eps_clip=0.2, eps_clip_high=0.2)
expected_pg, expected_clip = compute_policy_loss(ppo_kl, expected_tensor, eps_clip=0.2, eps_clip_high=0.2)

payload = {
    "kind": "slime_reward_norm_validation",
    "bug_id": BUG_ID,
    "status": "reproduced" if observed_rewards != expected_rewards else "fixed",
    "observed": {
        "train_data_rewards": observed_rewards,
        "policy_loss_mean": float(observed_pg.mean().item()),
        "clipfrac": [float(x) for x in observed_clip.tolist()],
    },
    "expected": {
        "train_data_rewards": expected_rewards,
        "policy_loss_mean": float(expected_pg.mean().item()),
        "clipfrac": [float(x) for x in expected_clip.tolist()],
    },
}
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
(OUTPUT_DIR / "training_signal_validation.json").write_text(json.dumps(payload, indent=2, sort_keys=True) + "\n")
print(json.dumps(payload, indent=2, sort_keys=True))

Observed output

On unfixed main:

{
  "status": "reproduced",
  "observed": {
    "train_data_rewards": [-0.9999995231628418, 0.0, 0.9999995231628418],
    "policy_loss_mean": -0.06666664034128189,
    "clipfrac": [0.0, 0.0, 1.0]
  },
  "expected": {
    "train_data_rewards": [-0.7071062922477722, 0.7071062922477722, 0.0],
    "policy_loss_mean": 0.06108969449996948,
    "clipfrac": [0.0, 0.0, 0.0]
  }
}

On this branch:

{
  "status": "fixed",
  "delta_after_fix": {
    "max_reward_abs_delta": 0.0,
    "max_policy_loss_abs_delta": 0.0
  }
}

Root cause

The fallback for uneven reward counts used:

rewards = rewards.view(-1, rewards.shape[-1])

That creates one normalization row containing every remaining sample. The function already receives Sample objects with group_index, so the fallback can preserve rollout group identity instead of mixing unrelated groups.

Fix

The patch builds grouped sample indices from Sample.group_index, falling back to positional chunks of n_samples_per_prompt only when a sample has no group id. Each group is mean-centered independently. For GRPO/GSPO std normalization, multi-sample groups keep the existing unbiased torch.std behavior, while singleton groups are returned as zero after centering so they stay finite.

Tests and checks

PYTHONPATH=${REPAIR_REPO} python -m pytest -q tests/test_rollout_reward_normalization.py
PYTHONPATH=${REPAIR_REPO} python -m ruff check slime/ray/rollout.py tests/test_rollout_reward_normalization.py
python -m pre_commit run --files slime/ray/rollout.py tests/test_rollout_reward_normalization.py

Results:

tests/test_rollout_reward_normalization.py . [100%]
All checks passed!
pre-commit: ruff check, autoflake, isort, and black passed

Contribution and duplicate checks

Upstream repo: THUDM/slime.

Contribution file read: CONTRIBUTING.md. It welcomes bug fixes with clear reproduction and tests, which matches this change.

Duplicate checks performed:

  • BUG_FINDINGS.md searched for RolloutManager, _post_process_rewards, uneven reward groups, singleton reward normalization, and reward-group contamination.
  • RL-Sentinel bug DB and historical loop artifacts searched, excluding the invalid loop root ${RL_SENTINEL_ARTIFACTS_ROOT}/testing-loop/slime-watch-20260518-002939-2.
  • Local PR drafts under pr_drafts/ searched.
  • Local and remote repair branches checked; the only local fix branch was fix/gspo-masked-old-logprob-nan, which fixes a different GSPO masked old-logprob NaN invariant.
  • Upstream PR search through gh was unavailable in this environment, so no fresh upstream PR result is claimed.
  • Upstream main still contained the rewards.view(-1, rewards.shape[-1]) fallback before this patch.

Related PRs or fixes

Related same-family findings in the local ledger cover reward contamination in other projects and masked invalid-value bugs. They are not exact duplicates because this issue is specific to slime RolloutManager._post_process_rewards mixing uneven rollout reward groups before training conversion. The prior branch fix/gspo-masked-old-logprob-nan is also not a duplicate; it addresses masked old logprobs in GSPO KL calculation rather than group reward normalization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant