[algo] SAPO algo by Qwen #4345

BounharAbdelaziz · 2025-11-28T19:53:13Z

What does this PR do?

Implements Soft Adaptive Policy Optimization (SAPO) for RL fine-tuning of LLMs. SAPO replaces hard clipping with temperature-controlled soft gating for more stable training and better sample efficiency. Paper: https://arxiv.org/abs/2511.20347

Checklist Before Starting

Search for similar PRs: SAPO search
Format PR title: [algo] feat: implement SAPO (Soft Adaptive Policy Optimization)

Design & Code Changes

Added compute_policy_loss_sapo() in verl/trainer/ppo/core_algos.py
Implements soft gate: f(r) = σ(τ(r-1)) · 4/τ where r = π_θ / π_θ_old
Uses asymmetric temperatures (τ_neg > τ_pos) as in original paper
Aggregation: seq-mean-token-mean as per paper

Checklist Before Submitting

XP: SAPO vs GRPO

Setup: Qwen3-4B-Base, total batch size 256, mini-batch size 32, context len 8192.
Stability: SAPO training is more stable than GRPO.
Response length: SAPO produces substantially longer responses on average.
Entropy / collapse: GRPO collapses quickly and its performance saturates early, while SAPO maintains healthier exploration (seen entropy).
Gradient norm: SAPO runs with slightly higher, but well-behaved, gradient norms.
PG loss: The PG loss under SAPO stays around −0.01, i.e., the underlying SAPO objective (before the leading minus sign) is positive, indicating the model consistently discovers and reinforces high-reward behaviors.

gemini-code-assist

Code Review

This pull request introduces the Soft Adaptive Policy Optimization (SAPO) algorithm by adding a new policy loss function and its configuration. The implementation is mostly correct, but I've found a few critical issues that need to be addressed. Firstly, the new configuration parameters tau_pos and tau_neg are missing from the ActorConfig dataclass, which will cause a crash on startup. Secondly, the loss_agg_mode in the new SAPO loss function is hardcoded, making it non-configurable. Lastly, there's a potential for a division-by-zero error in the gating function that could lead to training failure. Please address these points to ensure the stability and correctness of the new algorithm.

gemini-code-assist · 2025-11-28T19:54:47Z

verl/trainer/config/actor/actor.yaml

+# Positive and negative tau for smoothing function in SAPO (https://arxiv.org/pdf/2511.20347)
+# default values used in the paper with Qwen3-30B-A3B-Base
+tau_pos: 1.0
+tau_neg: 1.05


The new configuration parameters tau_pos and tau_neg are not defined in the ActorConfig dataclass located in verl/workers/config/actor.py. This will cause a ValidationError when Hydra/OmegaConf attempts to instantiate the ActorConfig from this YAML file, as these are unrecognized fields. To fix this, you need to add these fields to the ActorConfig dataclass definition.

gemini-code-assist · 2025-11-28T19:54:47Z

verl/trainer/ppo/core_algos.py

+
+    # for SAPO, we need to aggregate the loss at the sequence level (seq-mean-token-mean)
+    pg_loss = agg_loss(
+        loss_mat=pg_losses, loss_mask=response_mask, loss_agg_mode="seq-mean-token-mean", **config.global_batch_info


The loss_agg_mode is hardcoded as "seq-mean-token-mean" in the call to agg_loss. This completely ignores the loss_agg_mode parameter passed to the compute_policy_loss_sapo function, making this aspect of the loss calculation non-configurable. The function should use the value from the loss_agg_mode argument to allow for flexibility.

Suggested change

loss_mat=pg_losses, loss_mask=response_mask, loss_agg_mode="seq-mean-token-mean", **config.global_batch_info

loss_mat=pg_losses, loss_mask=response_mask, loss_agg_mode=loss_agg_mode, **config.global_batch_info

verl/trainer/ppo/core_algos.py

BounharAbdelaziz · 2025-11-28T19:56:00Z

@vermouth1992 @eric-haibin-lin

vermouth1992 · 2025-11-30T02:05:45Z

Could you please add an example script in the example folder? Thanks

vermouth1992 · 2025-11-30T02:06:04Z

Also, if possible, show a convergence curve of some task

BounharAbdelaziz · 2025-12-01T13:15:06Z

Could you please add an example script in the example folder? Thanks

@vermouth1992 done!

BounharAbdelaziz · 2025-12-01T13:16:23Z

Also, if possible, show a convergence curve of some task

@vermouth1992 two ongoing xps right now on dapo comparing vanilla GRPO and SAPO (I'll probably also add one run with GSPO).

vermouth1992 · 2025-12-03T02:14:43Z

Could you please fix sanity and precommit ci? Thanks.

BounharAbdelaziz · 2025-12-09T10:42:07Z

Could you please fix sanity and precommit ci? Thanks.

@vermouth1992 done!
issue was with the new tau params that needed to be added.
sorry for the delay.

implemented SAPO algo by Qwen

b6c1b52

BounharAbdelaziz requested review from PeterSH6, eric-haibin-lin, tongyx361 and vermouth1992 as code owners November 28, 2025 19:53

BounharAbdelaziz changed the title ~~implemented SAPO algo by Qwen~~ [algo] SAPO algo by Qwen Nov 28, 2025

gemini-code-assist bot reviewed Nov 28, 2025

View reviewed changes

Kirrito-k423 mentioned this pull request Nov 29, 2025

Support for SAPO Algorithm with Qwen3-VL in VeRL #4354

Closed

BounharAbdelaziz added 2 commits December 1, 2025 13:11

sapo training script example

2ee1e1c

added sapo tau params

545f1bc

BounharAbdelaziz force-pushed the main branch from a4293b1 to 545f1bc Compare December 1, 2025 13:13

vermouth1992 approved these changes Dec 2, 2025

View reviewed changes

BounharAbdelaziz added 2 commits December 9, 2025 10:38

Apply pre-commit formatting and regenerate configs

cf85a03

added lora dict to config

b24c217

erictang000 mentioned this pull request Dec 9, 2025

[skyrl-train] Add support for SAPO NovaSky-AI/SkyRL#761

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[algo] SAPO algo by Qwen #4345

[algo] SAPO algo by Qwen #4345

Uh oh!

BounharAbdelaziz commented Nov 28, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 28, 2025

Uh oh!

gemini-code-assist bot Nov 28, 2025

Uh oh!

Uh oh!

BounharAbdelaziz commented Nov 28, 2025

Uh oh!

vermouth1992 commented Nov 30, 2025

Uh oh!

vermouth1992 commented Nov 30, 2025

Uh oh!

BounharAbdelaziz commented Dec 1, 2025

Uh oh!

BounharAbdelaziz commented Dec 1, 2025

Uh oh!

vermouth1992 commented Dec 3, 2025

Uh oh!

BounharAbdelaziz commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	loss_mat=pg_losses, loss_mask=response_mask, loss_agg_mode="seq-mean-token-mean", **config.global_batch_info
	loss_mat=pg_losses, loss_mask=response_mask, loss_agg_mode=loss_agg_mode, **config.global_batch_info

[algo] SAPO algo by Qwen #4345

Are you sure you want to change the base?

[algo] SAPO algo by Qwen #4345

Uh oh!

Conversation

BounharAbdelaziz commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Design & Code Changes

Checklist Before Submitting

XP: SAPO vs GRPO

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

BounharAbdelaziz commented Nov 28, 2025

Uh oh!

vermouth1992 commented Nov 30, 2025

Uh oh!

vermouth1992 commented Nov 30, 2025

Uh oh!

BounharAbdelaziz commented Dec 1, 2025

Uh oh!

BounharAbdelaziz commented Dec 1, 2025

Uh oh!

vermouth1992 commented Dec 3, 2025

Uh oh!

BounharAbdelaziz commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BounharAbdelaziz commented Nov 28, 2025 •

edited

Loading