Skip to content

Conversation

@kaysonyu
Copy link
Contributor

@kaysonyu kaysonyu commented Jan 9, 2026

Summary

Fixes #1370 - Apply loss mask to KL divergence before computing returns in get_reinforce_plus_plus_returns.

Problem

In REINFORCE++ advantage estimation, the KL divergence values were not masked before computing returns. This caused environment-generated tokens (e.g., tool_response in agentic RL scenarios) to incorrectly contribute to the KL penalty that propagates through the cumulative return calculation.

Solution

Apply loss_mask to zero out KL values at masked positions before computing returns:

  # Before:
  token_level_rewards = -kl_coef * full_kl_response

  # After:
  masked_kl = full_kl_response * full_mask
  token_level_rewards = -kl_coef * masked_kl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

【BUG】no masking for tool_response in token_level_rewards when applied to agentic rl

1 participant