feat: add SAPO loss variant by pengdurice · Pull Request #2464 · NVIDIA-NeMo/RL

pengdurice · 2026-05-11T20:18:15Z

What does this PR do ?

Add SAPO loss and test cases

Issues

Usage

In your GRPO yaml file, add the following

loss_fn:
  sapo_enabled: true
  sapo_tau_pos: 1.0     # default
  sapo_tau_neg: 1.05

Reproduction recipe — Qwen3-30B-A3B (Megatron + vLLM, 2×8 H200)

Three arms sharing every hyperparameter except the loss. Follows the existing tests/test_suites/llm/<name>.sh ↔ examples/configs/recipes/llm/<name>.yaml convention.

Arm	Loss	`(sapo_enabled, τ_pos, τ_neg)`
`sapo-baseline`	vanilla GRPO	`(False, –, –)`
`sapo-asym`	SAPO, paper defaults	`(True, 1.0, 1.05)`
`sapo-symm`	SAPO, symmetric τ ablation	`(True, 1.0, 1.0)`

Files added:

examples/configs/recipes/llm/grpo-qwen3-30ba3b-2n8g-megatron-sapo-repro-base.yaml   # shared base
examples/configs/recipes/llm/grpo-qwen3-30ba3b-2n8g-megatron-sapo-baseline.yaml
examples/configs/recipes/llm/grpo-qwen3-30ba3b-2n8g-megatron-sapo-asym.yaml
examples/configs/recipes/llm/grpo-qwen3-30ba3b-2n8g-megatron-sapo-symm.yaml
tests/test_suites/llm/grpo-qwen3-30ba3b-2n8g-megatron-sapo-baseline.sh
tests/test_suites/llm/grpo-qwen3-30ba3b-2n8g-megatron-sapo-asym.sh
tests/test_suites/llm/grpo-qwen3-30ba3b-2n8g-megatron-sapo-symm.sh

All three arms inherit: Megatron TP=2 × EP=8, vLLM TP=4 colocated, 32 prompts × 16 generations = 512 trajectories per step, train_global_batch_size=128 → 4 mini-batches per rollout (matches GSPO §5.1 off-policy regime), KL β=0, temperature 1.0, seq=4096.

Empirical results (200 steps, math env, 2×8 H200)

	GRPO	SAPO asym	SAPO symm
Final val accuracy	0.5312	0.5781 (+4.7)	0.5625 (+3.1)
gen_kl drift (steps 1–50 → last 50)	+135 %	flat (0 %)	+8 %
Late-training behavior	peaks at step 120, regresses by 5.5 pts	still climbing at step 180	uniformly elevated

Three confirmations of paper claims: (1) SAPO asym wins on validation, (2) GRPO drifts off-policy while SAPO stays on-policy (the mechanism), (3) asymmetric τ > symmetric τ (Fig. 5 ordering). Magnitude is smaller than the paper's because our setting (Instruct + frozen router) is friendlier to GRPO than the paper's (Base + cold-start SFT, no routing replay).

Checklist

Loss math matches TRL reference + paper Eq. (5)–(6).
Unit tests added (gate math + incompatibility asserts).
Three reproduction arms differ from each other in exactly (sapo_enabled, sapo_tau_pos, sapo_tau_neg); YAML inheritance verified.
End-to-end runs on 2 nodes: SAPO asym beats GRPO on final val by +4.7 pts, GRPO gen_kl drift +135 % vs SAPO flat.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Signed-off-by: pengdurice <pengduhit@gmail.com>

copy-pr-bot · 2026-05-11T20:18:19Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

add SAPO loss with examples and online training results

b4e0954

Signed-off-by: pengdurice <pengduhit@gmail.com>

pengdurice changed the title ~~add SAPO loss with examples and online training results~~ feat: add SAPO loss with examples and online training results May 11, 2026

github-actions Bot added the community-request label May 11, 2026

pengdurice changed the title ~~feat: add SAPO loss with examples and online training results~~ feat: add SAPO loss variant May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add SAPO loss variant#2464

feat: add SAPO loss variant#2464
pengdurice wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
pengdurice:peng-sapo-v1

pengdurice commented May 11, 2026

Uh oh!

copy-pr-bot Bot commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pengdurice commented May 11, 2026

What does this PR do ?

Issues

Usage

Reproduction recipe — Qwen3-30B-A3B (Megatron + vLLM, 2×8 H200)

Empirical results (200 steps, math env, 2×8 H200)

Checklist

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants