feat: add online DPO training by taivu1998 · Pull Request #2456 · NVIDIA-NeMo/RL

taivu1998 · 2026-05-10T10:53:43Z

Summary

Adds a first Online DPO training path that samples two policy responses per prompt, scores them through NeMo RL environments, converts the higher/lower reward responses into chosen/rejected preference pairs, and trains with the existing DPO loss.

Closes #1816.

What Changed

Added examples/run_online_dpo.py as the Online DPO entrypoint.
Added nemo_rl/algorithms/online_dpo.py with rollout collection, online preference-pair construction, reference logprob calculation, DPO training, validation, checkpointing, and resume state.
Added examples/configs/online_dpo.yaml as the exemplar config and source of defaults.
Added Online DPO documentation and linked it from the main docs index and algorithm matrix.
Added unit coverage for pair construction, environment-observation stripping, DPO collation, and reference-logprob rolling.
Added a functional smoke script and wired it into the fast L1 GPU suite.

Initial Scope

The implementation is intentionally narrow to keep the first version predictable:

text-only LLM prompts
one rollout turn
exactly two generations per prompt
scalar environment rewards
single prompt dataloader
colocated generation
no NeMo-Gym rollouts
no dynamic batching or sequence packing

Unsupported surfaces are rejected during setup instead of being silently accepted.

Validation

Passed locally:

uvx ruff check nemo_rl/algorithms/online_dpo.py examples/run_online_dpo.py tests/unit/algorithms/test_online_dpo.py
uvx ruff format --check nemo_rl/algorithms/online_dpo.py examples/run_online_dpo.py tests/unit/algorithms/test_online_dpo.py
bash -n tests/functional/online_dpo.sh
bash -n tests/functional/L1_Functional_Tests_GPU.sh
git diff --cached --check
/usr/local/bin/python3.13 -m py_compile nemo_rl/algorithms/online_dpo.py examples/run_online_dpo.py tests/unit/algorithms/test_online_dpo.py
YAML parse check for examples/configs/online_dpo.yaml

Could not run the focused pytest locally because the default /usr/local/bin/python3.13 is not inspectable by uv on this machine (platform.mac_ver() returns empty), the available uv-managed Python is 3.13.12 while this repo requires >=3.13.13, and the direct 3.13.13 interpreter does not have project dependencies such as torch and pytest.

Signed-off-by: taivu1998 <46636857+taivu1998@users.noreply.github.com>

copy-pr-bot · 2026-05-10T10:53:46Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

feat: add online DPO training

d47f7ce

Signed-off-by: taivu1998 <46636857+taivu1998@users.noreply.github.com>

github-actions Bot added Documentation Improvements or additions to documentation community-request labels May 10, 2026

taivu1998 marked this pull request as ready for review May 11, 2026 03:06

taivu1998 requested review from a team as code owners May 11, 2026 03:06

svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label May 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add online DPO training#2456

feat: add online DPO training#2456
taivu1998 wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
taivu1998:tdv/issue-1816-online-dpo

taivu1998 commented May 10, 2026

Uh oh!

copy-pr-bot Bot commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

taivu1998 commented May 10, 2026

Summary

What Changed

Initial Scope

Validation

Uh oh!

copy-pr-bot Bot commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants