-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[algo] SAPO algo by Qwen #4345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[algo] SAPO algo by Qwen #4345
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces the Soft Adaptive Policy Optimization (SAPO) algorithm by adding a new policy loss function and its configuration. The implementation is mostly correct, but I've found a few critical issues that need to be addressed. Firstly, the new configuration parameters tau_pos and tau_neg are missing from the ActorConfig dataclass, which will cause a crash on startup. Secondly, the loss_agg_mode in the new SAPO loss function is hardcoded, making it non-configurable. Lastly, there's a potential for a division-by-zero error in the gating function that could lead to training failure. Please address these points to ensure the stability and correctness of the new algorithm.
| # Positive and negative tau for smoothing function in SAPO (https://arxiv.org/pdf/2511.20347) | ||
| # default values used in the paper with Qwen3-30B-A3B-Base | ||
| tau_pos: 1.0 | ||
| tau_neg: 1.05 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new configuration parameters tau_pos and tau_neg are not defined in the ActorConfig dataclass located in verl/workers/config/actor.py. This will cause a ValidationError when Hydra/OmegaConf attempts to instantiate the ActorConfig from this YAML file, as these are unrecognized fields. To fix this, you need to add these fields to the ActorConfig dataclass definition.
|
|
||
| # for SAPO, we need to aggregate the loss at the sequence level (seq-mean-token-mean) | ||
| pg_loss = agg_loss( | ||
| loss_mat=pg_losses, loss_mask=response_mask, loss_agg_mode="seq-mean-token-mean", **config.global_batch_info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loss_agg_mode is hardcoded as "seq-mean-token-mean" in the call to agg_loss. This completely ignores the loss_agg_mode parameter passed to the compute_policy_loss_sapo function, making this aspect of the loss calculation non-configurable. The function should use the value from the loss_agg_mode argument to allow for flexibility.
| loss_mat=pg_losses, loss_mask=response_mask, loss_agg_mode="seq-mean-token-mean", **config.global_batch_info | |
| loss_mat=pg_losses, loss_mask=response_mask, loss_agg_mode=loss_agg_mode, **config.global_batch_info |
|
Could you please add an example script in the example folder? Thanks |
|
Also, if possible, show a convergence curve of some task |
a4293b1 to
545f1bc
Compare
@vermouth1992 done! |
@vermouth1992 two ongoing xps right now on dapo comparing vanilla GRPO and SAPO (I'll probably also add one run with GSPO). |
|
Could you please fix sanity and precommit ci? Thanks. |
@vermouth1992 done! |
What does this PR do?
Implements Soft Adaptive Policy Optimization (SAPO) for RL fine-tuning of LLMs. SAPO replaces hard clipping with temperature-controlled soft gating for more stable training and better sample efficiency. Paper: https://arxiv.org/abs/2511.20347
Checklist Before Starting
[algo] feat: implement SAPO (Soft Adaptive Policy Optimization)Design & Code Changes
compute_policy_loss_sapo()inverl/trainer/ppo/core_algos.pyf(r) = σ(τ(r-1)) · 4/τwherer = π_θ / π_θ_oldτ_neg > τ_pos) as in original paperseq-mean-token-meanas per paperChecklist Before Submitting
XP: SAPO vs GRPO