Implement Tunix-based DPO/ORPO integration. by igorts-git · Pull Request #3668 · AI-Hypercomputer/maxtext

igorts-git · 2026-04-15T15:46:15Z

Description

Tunix-based DPO/ORPO implementation.

This DPO implementation is based on train_sft.py. The hooks are shared with SFT (see #3862).

Only HF datasets are currently supported (no grain and TFDS) yet.

After this PR I plan to follow up with:

Run on a real dataset with eval to prove that it converges.
Additional documentation in the form of a Jupyter notebook and example running scripts.
Potentially code refactoring to share more functionality with SFT.
End-to-end tests.
Delete of the legacy DPO implementation.
Add support of grain and TFDS datasets.

FIXES: b/485626968

Tests

Ran DPO training on qwen2.5-1.5b, confirmed that the model. Performed qualitative validation tests decoded responses for targeted conceptual prompts:

Prompt: "What is DPO (Direct Preference Optimization)?"
- Baseline Response: Outlined broad marketing behaviors, describing DPO as structural strategies centered on audience demographics.
- DPO Model Response: Correctly aligned concept to mathematical post-training optimization, defining DPO as a target optimization algorithm, showing clear parameter update steer.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-04-15T15:51:21Z

Codecov Report

❌ Patch coverage is 77.15736% with 45 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/trainers/post_train/hooks.py	72.03%	29 Missing and 4 partials ⚠️
src/maxtext/common/metric_logger.py	0.00%	5 Missing and 2 partials ⚠️
src/maxtext/input_pipeline/dpo_utils.py	95.91%	0 Missing and 2 partials ⚠️
src/maxtext/input_pipeline/hf_data_processing.py	71.42%	1 Missing and 1 partial ⚠️
src/maxtext/utils/maxtext_utils.py	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

SurbhiJainUSC · 2026-04-16T19:35:57Z

+    return jnp.sum(batch["targets_segmentation"] != 0)
+
+
+class DPOTrainingHooks(SFTTrainingHooks):


Please move this class to src/maxtext/trainers/post_train/dpo/hooks.py

SurbhiJainUSC · 2026-04-16T19:42:46Z

+
+
+class DPOTrainingHooks(SFTTrainingHooks):
+  """Training hooks for DPO.


Also, we can move common functionalities from SFTTrainingHooks and SFTDataHooks to maxtext/trainers/post_train/hooks.py and create child classes:

SFTTrainingHooks and SFTDataHooks in post_train/sft that inherits

DPOTrainingHooks and DPODataHooks in post_train/dpo that inherits
This way, we could reuse those hooks implementation for RL hooks as well in future.

SurbhiJainUSC · 2026-04-16T19:51:35Z

 use_dpo: true
-train_data_columns: ['chosen', 'rejected']
-eval_data_columns: ['chosen', 'rejected']
+use_orpo: false


Instead of use_orpo, can we have a nested config like

maxtext/src/maxtext/configs/post_train/rl.yml

Line 37 in 84bf0df

rl:

dpo: algo: dpo or orpo dpo_beta orpo_lambda ....

This is a great idea, but I would rather postpone this until a later PR. First I want to clean up all the legacy DPO code, which would come as a follow up PR.

SurbhiJainUSC · 2026-04-16T19:55:43Z

This file is growing. Let's keep DPO specific changes in a new module input_pipeline/dpo_utils.py

SurbhiJainUSC · 2026-04-16T19:59:46Z

+  def map(self, element):
+    """Maps original DPO columns to Tunix-compatible pre-tokenized format."""
+    # Handle the 'input' -> 'prompt_ids' mapping
+    prompt_ids = self._pad(element.pop("input"), self.max_prompt_length, left=True)


Please raise KeyError exception when the required key is not found in element

SurbhiJainUSC · 2026-04-16T20:00:15Z

+    # Handle the 'input' -> 'prompt_ids' mapping
+    prompt_ids = self._pad(element.pop("input"), self.max_prompt_length, left=True)
+    chosen_ids = self._pad(element.pop("chosen"), self.max_response_length, left=False)
+    rejected_ids = self._pad(element.pop("rejected"), self.max_response_length, left=False)


Can we not hardcode the keys like chosen, rejected?

Yes, in fact I had a follow up PR for that and also to allow for datasets where there are only 2 columns: "chosen" and "rejected" where the "prompt" is inferred automatically by looking at the common prefix in "chosen" and "rejected". I cherry-picked that PR into this one.

SurbhiJainUSC · 2026-04-16T20:01:20Z

    return element


+class DPOTunixPrep(grain.MapTransform):


Use @dataclasses.dataclass to simplify the initialization

done (moved to a new file now)

SurbhiJainUSC · 2026-04-16T20:04:29Z

Can we also add a jupyter notebook for DPO/ORPO, which can run on Github CI to validate the implementation.

I just added the notebook (vibe-coded), but I don't want to add it to Github CI yet. I still have a few follow up PRs to clean things up.

…ation

…I validation loop

igorts-git force-pushed the igorts/dpo-feature-integration branch 2 times, most recently from 96406ac to 9e02c1d Compare April 15, 2026 20:29

igorts-git changed the title ~~[WIP. DO NOT REVIEW YET] Implement Tunix-DPO/ORPO integration.~~ Implement Tunix-based DPO/ORPO integration. Apr 15, 2026

igorts-git marked this pull request as ready for review April 15, 2026 22:12

igorts-git force-pushed the igorts/dpo-feature-integration branch from 9e02c1d to 0d115b4 Compare April 15, 2026 22:25

SurbhiJainUSC reviewed Apr 16, 2026

View reviewed changes

igorts-git force-pushed the igorts/dpo-feature-integration branch 2 times, most recently from aca4891 to 3351239 Compare April 17, 2026 22:43

shralex mentioned this pull request Apr 18, 2026

DPO issues #1089

Open

igorts-git force-pushed the igorts/dpo-feature-integration branch 2 times, most recently from 67f3373 to 1544c1b Compare April 20, 2026 17:08

igorts-git force-pushed the igorts/dpo-feature-integration branch 6 times, most recently from 0f03a83 to b8b2171 Compare May 10, 2026 18:43

igorts-git mentioned this pull request May 10, 2026

Refactor: extract shared post-training hooks and update SFT implementation #3862

Merged

4 tasks

igorts-git force-pushed the igorts/dpo-feature-integration branch 6 times, most recently from 714d6e9 to db68194 Compare May 11, 2026 03:50

igorts-git added 2 commits May 11, 2026 09:49

Refactor: extract shared post-training hooks and update SFT implement…

99e947e

…ation

Feature: integrate post-training Direct Preference Optimization (DPO)

46a9ab3

igorts-git force-pushed the igorts/dpo-feature-integration branch from db68194 to 46a9ab3 Compare May 11, 2026 16:59

Integrate standalone headless DPO and ORPO notebooks, and automated C…

ab0e579

…I validation loop

igorts-git requested a review from parambole as a code owner May 11, 2026 23:50

		return jnp.sum(batch["targets_segmentation"] != 0)


		class DPOTrainingHooks(SFTTrainingHooks):



		class DPOTrainingHooks(SFTTrainingHooks):
		"""Training hooks for DPO.

Conversation

igorts-git commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

codecov Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SurbhiJainUSC Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

igorts-git commented Apr 15, 2026 •

edited

Loading

codecov Bot commented Apr 15, 2026 •

edited

Loading

SurbhiJainUSC Apr 16, 2026 •

edited

Loading