Skip to content

feat(grpo): log per-optimizer step metrics#2452

Open
taivu1998 wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
taivu1998:tdv/issue-1435-step-logging
Open

feat(grpo): log per-optimizer step metrics#2452
taivu1998 wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
taivu1998:tdv/issue-1435-step-logging

Conversation

@taivu1998
Copy link
Copy Markdown

Summary

  • Add explicit optimizer-step counters to GRPO logging so each PPO optimization step is emitted under train/optim/* with train/optim_step.
  • Preserve RL-step visibility by adding train/rl_step, train/optim_step, and train/num_optim_steps_per_rl_step to aggregate train logs.
  • Carry per-optimizer-step metrics through DTensor v1, DTensor v2, and Megatron workers, then aggregate them across policy workers by optimizer-step index.
  • Persist total_optim_steps in GRPO checkpoints with fallback inference for older checkpoints.

Root Cause

GRPO policy training can run multiple optimizer steps inside one RL step, but metrics were only aggregated and logged once per RL step. That hid per-step behavior and made off-policy PPO diagnostics hard to read.

Validation

  • uvx ruff check nemo_rl/algorithms/grpo.py nemo_rl/models/policy/lm_policy.py nemo_rl/models/policy/utils.py nemo_rl/models/policy/workers/dtensor_policy_worker.py nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py nemo_rl/models/policy/workers/megatron_policy_worker.py nemo_rl/utils/logger.py tests/unit/algorithms/test_grpo.py tests/unit/models/policy/test_utils.py tests/unit/utils/test_logger.py
  • uvx ruff format --check nemo_rl/algorithms/grpo.py nemo_rl/models/policy/lm_policy.py nemo_rl/models/policy/utils.py nemo_rl/models/policy/workers/dtensor_policy_worker.py nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py nemo_rl/models/policy/workers/megatron_policy_worker.py nemo_rl/utils/logger.py tests/unit/algorithms/test_grpo.py tests/unit/models/policy/test_utils.py tests/unit/utils/test_logger.py
  • /usr/local/bin/python3.10 source compile for all touched files
  • Direct source-extracted GRPO helper checks
  • /Users/vuductai/Documents/Projects/RL/.venv-dev/bin/python -m pytest tests/unit/models/policy/test_utils.py tests/unit/utils/test_logger.py -q -k "optim_step_metrics or aggregate_metric_dicts or unscale_loss_metrics or step_metric or gpu_monitoring" (13 passed, 69 deselected)

Local Environment Notes

  • uv run pytest ... is blocked locally by a broken /usr/local/bin/python3.13 install.
  • Focused tests/unit/algorithms/test_grpo.py collection is blocked in the local venv by missing optional soundfile after adding Megatron submodule paths.

Closes #1435.

Signed-off-by: taivu1998 <46636857+taivu1998@users.noreply.github.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 10, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@taivu1998 taivu1998 marked this pull request as ready for review May 11, 2026 03:07
@taivu1998 taivu1998 requested review from a team as code owners May 11, 2026 03:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

report rl and optimization step differently

2 participants