Fix: Apply loss mask to KL in REINFORCE++ returns calculation #1372

kaysonyu · 2026-01-09T09:19:22Z

Summary

Fixes #1370 - Apply loss mask to KL divergence before computing returns in get_reinforce_plus_plus_returns.

Problem

In REINFORCE++ advantage estimation, the KL divergence values were not masked before computing returns. This caused environment-generated tokens (e.g., tool_response in agentic RL scenarios) to incorrectly contribute to the KL penalty that propagates through the cumulative return calculation.

Solution

Apply loss_mask to zero out KL values at masked positions before computing returns:

  # Before:
  token_level_rewards = -kl_coef * full_kl_response

  # After:
  masked_kl = full_kl_response * full_mask
  token_level_rewards = -kl_coef * masked_kl

Fix: Apply loss mask to KL in REINFORCE++ returns calculation

5bd54f3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Apply loss mask to KL in REINFORCE++ returns calculation #1372

Fix: Apply loss mask to KL in REINFORCE++ returns calculation #1372

Uh oh!

kaysonyu commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix: Apply loss mask to KL in REINFORCE++ returns calculation #1372

Are you sure you want to change the base?

Fix: Apply loss mask to KL in REINFORCE++ returns calculation #1372

Uh oh!

Conversation

kaysonyu commented Jan 9, 2026

Summary

Problem

Solution

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant