Skip to content

feat(grpo): add SAPO actor loss#2455

Open
taivu1998 wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
taivu1998:tdv/issue-1677-sapo
Open

feat(grpo): add SAPO actor loss#2455
taivu1998 wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
taivu1998:tdv/issue-1677-sapo

Conversation

@taivu1998
Copy link
Copy Markdown

Summary

  • Add Soft Adaptive Policy Optimization (SAPO) as a selectable GRPO actor loss via loss_fn.actor_loss_type: "sapo".
  • Keep the existing PPO/GRPO/DAPO/GSPO/REINFORCE behavior under the default actor_loss_type: "ppo_clip".
  • Add SAPO defaults to GRPO exemplar, NeMo-Gym, ModelOpt, and template configs.
  • Document the SAPO config surface in the GRPO guide.
  • Add focused unit coverage for SAPO forward values, metrics, gradients, importance-sampling correction, extreme log-ratio stability, and incompatible config validation.

Closes #1677.

Motivation

Issue #1677 requests support for the SAPO algorithm from https://arxiv.org/pdf/2511.20347. The current GRPO loss path supports PPO-style clipped objectives and related variants, but does not expose SAPO smooth adaptive actor surrogate behavior.

Implementation

  • Extend ClippedPGLossConfig with actor_loss_type, sapo_tau_pos, sapo_tau_neg, and sapo_log_ratio_clamp_value.
  • Implement the SAPO token-level surrogate 4 / tau * sigmoid(tau * (r - 1)), with tau selected by advantage sign.
  • Add an optional SAPO-only log-ratio clamp before exponentiation, plus finite ratio handling for numerical guardrails.
  • Preserve the existing KL penalty, rollout/logprob data flow, train/inference importance-sampling correction, metrics, and non-SAPO actor-loss behavior.
  • Reject unsupported SAPO combinations that would silently change the objective semantics, including sequence-level importance ratios, sequence-level loss, disable_ppo_ratio, force_on_policy_ratio, and dual clipping.
  • Log SAPO activation during GRPO setup.

Validation

  • Focused SAPO unit suite: 12 passed, 43 deselected for tests/unit/algorithms/test_loss_functions.py -k sapo using a no-project harness with managed Python, because this local machine has a broken /usr/local/bin/python3.13 and the project requires >=3.13.13.
  • ruff check on all touched Python files.
  • ruff format --check on all touched Python files.
  • python -m py_compile on all touched Python files.
  • git diff --check.
  • YAML inheritance scan over all GRPO configs in examples/ and research/template_project that mention ratio_clip_c; all resolve the new SAPO config keys.

Notes

The default remains actor_loss_type: "ppo_clip", so existing configs preserve their current behavior unless SAPO is explicitly enabled.

Signed-off-by: taivu1998 <46636857+taivu1998@users.noreply.github.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 10, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added Documentation Improvements or additions to documentation community-request labels May 10, 2026
@taivu1998 taivu1998 marked this pull request as ready for review May 11, 2026 03:06
@taivu1998 taivu1998 requested review from a team and terrykong as code owners May 11, 2026 03:06
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request Documentation Improvements or additions to documentation waiting-on-maintainers Waiting on maintainers to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SAPO algorithm

2 participants