Skip to content

feat: add online DPO training#2456

Open
taivu1998 wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
taivu1998:tdv/issue-1816-online-dpo
Open

feat: add online DPO training#2456
taivu1998 wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
taivu1998:tdv/issue-1816-online-dpo

Conversation

@taivu1998
Copy link
Copy Markdown

Summary

Adds a first Online DPO training path that samples two policy responses per prompt, scores them through NeMo RL environments, converts the higher/lower reward responses into chosen/rejected preference pairs, and trains with the existing DPO loss.

Closes #1816.

What Changed

  • Added examples/run_online_dpo.py as the Online DPO entrypoint.
  • Added nemo_rl/algorithms/online_dpo.py with rollout collection, online preference-pair construction, reference logprob calculation, DPO training, validation, checkpointing, and resume state.
  • Added examples/configs/online_dpo.yaml as the exemplar config and source of defaults.
  • Added Online DPO documentation and linked it from the main docs index and algorithm matrix.
  • Added unit coverage for pair construction, environment-observation stripping, DPO collation, and reference-logprob rolling.
  • Added a functional smoke script and wired it into the fast L1 GPU suite.

Initial Scope

The implementation is intentionally narrow to keep the first version predictable:

  • text-only LLM prompts
  • one rollout turn
  • exactly two generations per prompt
  • scalar environment rewards
  • single prompt dataloader
  • colocated generation
  • no NeMo-Gym rollouts
  • no dynamic batching or sequence packing

Unsupported surfaces are rejected during setup instead of being silently accepted.

Validation

Passed locally:

  • uvx ruff check nemo_rl/algorithms/online_dpo.py examples/run_online_dpo.py tests/unit/algorithms/test_online_dpo.py
  • uvx ruff format --check nemo_rl/algorithms/online_dpo.py examples/run_online_dpo.py tests/unit/algorithms/test_online_dpo.py
  • bash -n tests/functional/online_dpo.sh
  • bash -n tests/functional/L1_Functional_Tests_GPU.sh
  • git diff --cached --check
  • /usr/local/bin/python3.13 -m py_compile nemo_rl/algorithms/online_dpo.py examples/run_online_dpo.py tests/unit/algorithms/test_online_dpo.py
  • YAML parse check for examples/configs/online_dpo.yaml

Could not run the focused pytest locally because the default /usr/local/bin/python3.13 is not inspectable by uv on this machine (platform.mac_ver() returns empty), the available uv-managed Python is 3.13.12 while this repo requires >=3.13.13, and the direct 3.13.13 interpreter does not have project dependencies such as torch and pytest.

Signed-off-by: taivu1998 <46636857+taivu1998@users.noreply.github.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 10, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added Documentation Improvements or additions to documentation community-request labels May 10, 2026
@taivu1998 taivu1998 marked this pull request as ready for review May 11, 2026 03:06
@taivu1998 taivu1998 requested review from a team as code owners May 11, 2026 03:06
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request Documentation Improvements or additions to documentation waiting-on-maintainers Waiting on maintainers to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Online DPO Support

2 participants