Skip to content

fix: avoid applying rollout temperature to critic values#1928

Open
Baiyu-Su wants to merge 1 commit into
THUDM:mainfrom
Baiyu-Su:fix-value-temperature-scaling
Open

fix: avoid applying rollout temperature to critic values#1928
Baiyu-Su wants to merge 1 commit into
THUDM:mainfrom
Baiyu-Su:fix-value-temperature-scaling

Conversation

@Baiyu-Su
Copy link
Copy Markdown

What

Avoid applying rollout_temperature to critic value-head outputs.

get_responses() is shared by response-aligned policy-logit extraction and value extraction. Temperature scaling is needed for policy logits when reconstructing rollout log probabilities, but critic values are scalar predictions and should not be divided by the rollout sampling temperature.

Changes

  • Add an apply_temperature flag to get_responses().
  • Keep temperature scaling enabled by default for existing policy/logprob paths.
  • Disable temperature scaling in get_values().
  • Add a zero-GPU unit test for non-unit rollout_temperature.

Tested

  • ruff check slime/backends/megatron_utils/loss.py tests/test_value_temperature.py
  • isort --profile=black --filter-files --check-only slime/backends/megatron_utils/loss.py tests/test_value_temperature.py
  • black --check slime/backends/megatron_utils/loss.py tests/test_value_temperature.py
  • PYTHONDONTWRITEBYTECODE=1 python -m py_compile slime/backends/megatron_utils/loss.py tests/test_value_temperature.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant