Skip to content

feat: add SAPO loss variant#2464

Draft
pengdurice wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
pengdurice:peng-sapo-v1
Draft

feat: add SAPO loss variant#2464
pengdurice wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
pengdurice:peng-sapo-v1

Conversation

@pengdurice
Copy link
Copy Markdown
Contributor

What does this PR do ?

Add SAPO loss and test cases

Issues

#1677

Usage

In your GRPO yaml file, add the following

loss_fn:
  sapo_enabled: true
  sapo_tau_pos: 1.0     # default
  sapo_tau_neg: 1.05 

Reproduction recipe — Qwen3-30B-A3B (Megatron + vLLM, 2×8 H200)

Three arms sharing every hyperparameter except the loss. Follows the existing tests/test_suites/llm/<name>.shexamples/configs/recipes/llm/<name>.yaml convention.

Arm Loss (sapo_enabled, τ_pos, τ_neg)
sapo-baseline vanilla GRPO (False, –, –)
sapo-asym SAPO, paper defaults (True, 1.0, 1.05)
sapo-symm SAPO, symmetric τ ablation (True, 1.0, 1.0)

Files added:

examples/configs/recipes/llm/grpo-qwen3-30ba3b-2n8g-megatron-sapo-repro-base.yaml   # shared base
examples/configs/recipes/llm/grpo-qwen3-30ba3b-2n8g-megatron-sapo-baseline.yaml
examples/configs/recipes/llm/grpo-qwen3-30ba3b-2n8g-megatron-sapo-asym.yaml
examples/configs/recipes/llm/grpo-qwen3-30ba3b-2n8g-megatron-sapo-symm.yaml
tests/test_suites/llm/grpo-qwen3-30ba3b-2n8g-megatron-sapo-baseline.sh
tests/test_suites/llm/grpo-qwen3-30ba3b-2n8g-megatron-sapo-asym.sh
tests/test_suites/llm/grpo-qwen3-30ba3b-2n8g-megatron-sapo-symm.sh

All three arms inherit: Megatron TP=2 × EP=8, vLLM TP=4 colocated, 32 prompts × 16 generations = 512 trajectories per step, train_global_batch_size=1284 mini-batches per rollout (matches GSPO §5.1 off-policy regime), KL β=0, temperature 1.0, seq=4096.

Empirical results (200 steps, math env, 2×8 H200)

GRPO SAPO asym SAPO symm
Final val accuracy 0.5312 0.5781 (+4.7) 0.5625 (+3.1)
gen_kl drift (steps 1–50 → last 50) +135 % flat (0 %) +8 %
Late-training behavior peaks at step 120, regresses by 5.5 pts still climbing at step 180 uniformly elevated

Three confirmations of paper claims: (1) SAPO asym wins on validation, (2) GRPO drifts off-policy while SAPO stays on-policy (the mechanism), (3) asymmetric τ > symmetric τ (Fig. 5 ordering). Magnitude is smaller than the paper's because our setting (Instruct + frozen router) is friendlier to GRPO than the paper's (Base + cold-start SFT, no routing replay).

Checklist

  • Loss math matches TRL reference + paper Eq. (5)–(6).
  • Unit tests added (gate math + incompatibility asserts).
  • Three reproduction arms differ from each other in exactly (sapo_enabled, sapo_tau_pos, sapo_tau_neg); YAML inheritance verified.
  • End-to-end runs on 2 nodes: SAPO asym beats GRPO on final val by +4.7 pts, GRPO gen_kl drift +135 % vs SAPO flat.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Signed-off-by: pengdurice <pengduhit@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 11, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@pengdurice pengdurice changed the title add SAPO loss with examples and online training results feat: add SAPO loss with examples and online training results May 11, 2026
@pengdurice pengdurice changed the title feat: add SAPO loss with examples and online training results feat: add SAPO loss variant May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants