Skip to content

Conversation

@BounharAbdelaziz
Copy link
Contributor

@BounharAbdelaziz BounharAbdelaziz commented Nov 28, 2025

What does this PR do?

Implements Soft Adaptive Policy Optimization (SAPO) for RL fine-tuning of LLMs. SAPO replaces hard clipping with temperature-controlled soft gating for more stable training and better sample efficiency. Paper: https://arxiv.org/abs/2511.20347

Checklist Before Starting

  • Search for similar PRs: SAPO search
  • Format PR title: [algo] feat: implement SAPO (Soft Adaptive Policy Optimization)

Design & Code Changes

  • Added compute_policy_loss_sapo() in verl/trainer/ppo/core_algos.py
  • Implements soft gate: f(r) = σ(τ(r-1)) · 4/τ where r = π_θ / π_θ_old
  • Uses asymmetric temperatures (τ_neg > τ_pos) as in original paper
  • Aggregation: seq-mean-token-mean as per paper

Checklist Before Submitting

  • Read Contribute Guide
  • Apply pre-commit checks
  • Add/update documentation
  • Add unit tests (if feasible)
  • Request CI in Slack

XP: SAPO vs GRPO

  • Setup: Qwen3-4B-Base, total batch size 256, mini-batch size 32, context len 8192.
  • Stability: SAPO training is more stable than GRPO.
  • Response length: SAPO produces substantially longer responses on average.
  • Entropy / collapse: GRPO collapses quickly and its performance saturates early, while SAPO maintains healthier exploration (seen entropy).
  • Gradient norm: SAPO runs with slightly higher, but well-behaved, gradient norms.
  • PG loss: The PG loss under SAPO stays around −0.01, i.e., the underlying SAPO objective (before the leading minus sign) is positive, indicating the model consistently discovers and reinforces high-reward behaviors.
Screenshot 2025-12-09 at 20 35 34 Screenshot 2025-12-09 at 20 36 23

@BounharAbdelaziz BounharAbdelaziz changed the title implemented SAPO algo by Qwen [algo] SAPO algo by Qwen Nov 28, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the Soft Adaptive Policy Optimization (SAPO) algorithm by adding a new policy loss function and its configuration. The implementation is mostly correct, but I've found a few critical issues that need to be addressed. Firstly, the new configuration parameters tau_pos and tau_neg are missing from the ActorConfig dataclass, which will cause a crash on startup. Secondly, the loss_agg_mode in the new SAPO loss function is hardcoded, making it non-configurable. Lastly, there's a potential for a division-by-zero error in the gating function that could lead to training failure. Please address these points to ensure the stability and correctness of the new algorithm.

Comment on lines +44 to +47
# Positive and negative tau for smoothing function in SAPO (https://arxiv.org/pdf/2511.20347)
# default values used in the paper with Qwen3-30B-A3B-Base
tau_pos: 1.0
tau_neg: 1.05
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The new configuration parameters tau_pos and tau_neg are not defined in the ActorConfig dataclass located in verl/workers/config/actor.py. This will cause a ValidationError when Hydra/OmegaConf attempts to instantiate the ActorConfig from this YAML file, as these are unrecognized fields. To fix this, you need to add these fields to the ActorConfig dataclass definition.


# for SAPO, we need to aggregate the loss at the sequence level (seq-mean-token-mean)
pg_loss = agg_loss(
loss_mat=pg_losses, loss_mask=response_mask, loss_agg_mode="seq-mean-token-mean", **config.global_batch_info
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The loss_agg_mode is hardcoded as "seq-mean-token-mean" in the call to agg_loss. This completely ignores the loss_agg_mode parameter passed to the compute_policy_loss_sapo function, making this aspect of the loss calculation non-configurable. The function should use the value from the loss_agg_mode argument to allow for flexibility.

Suggested change
loss_mat=pg_losses, loss_mask=response_mask, loss_agg_mode="seq-mean-token-mean", **config.global_batch_info
loss_mat=pg_losses, loss_mask=response_mask, loss_agg_mode=loss_agg_mode, **config.global_batch_info

@BounharAbdelaziz
Copy link
Contributor Author

@vermouth1992
Copy link
Collaborator

Could you please add an example script in the example folder? Thanks

@vermouth1992
Copy link
Collaborator

Also, if possible, show a convergence curve of some task

@BounharAbdelaziz
Copy link
Contributor Author

Could you please add an example script in the example folder? Thanks

@vermouth1992 done!

@BounharAbdelaziz
Copy link
Contributor Author

Also, if possible, show a convergence curve of some task

@vermouth1992 two ongoing xps right now on dapo comparing vanilla GRPO and SAPO (I'll probably also add one run with GSPO).

@vermouth1992
Copy link
Collaborator

Could you please fix sanity and precommit ci? Thanks.

@BounharAbdelaziz
Copy link
Contributor Author

Could you please fix sanity and precommit ci? Thanks.

@vermouth1992 done!
issue was with the new tau params that needed to be added.
sorry for the delay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants