Added SFT Pre-Processing for Grain Input Pipeline#3437

Open

ajkv-google wants to merge 8 commits intomainfrom

ajkv/sft-grain-implementation

Collaborator

ajkv-google commented Mar 18, 2026 •

edited

Loading

Description

This PR introduces SFT support to the Grain input pipeline by adding a separate sft_preprocessing_pipeline function. Rather than cluttering the existing pretrain code, it uses simple conditionals inside the train and eval iterators to route to this new SFT logic. I followed the existing Hugging Face SFT implementation and adapted its logic to be compatible with Grain's element-wise datasets.

Tests

I added a unit test to verify end-to-end functionality to make sure the Grain SFT pipeline formats the data and outputs correctly. Ran this command to execute the unit test:

pytest tests/unit/grain_data_processing_test.py::GrainSFTParquetProcessingTest -v

This is the output of the test: Test Passed Output

Also, ran the training pipeline in Maxtext with sft enabled using a grain dataset with this command:

python3 -m maxtext.trainers.post_train.sft.train_sft src/maxtext/configs/post_train/sft.yml run_name=test_grain_sft dataset_type=grain grain_file_type=parquet grain_train_files=gs://maxtext-dataset/hf/ultrachat_200k/train_sft-*.parquet steps=10 tokenizer_type=huggingface tokenizer_path=HuggingFaceH4/zephyr-7b-beta

Verified that the sft processing changes worked and trained successfully: Logs

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.


          Added SFT support for grain input pipeline

30cfc2c

ajkv-google requested review from A9isha, NicoGrande, NuojCheng, RissyRan, SurbhiJainUSC, aireenmei, bvandermoon, dipannita08, gagika, gobbleturk, hengtaoguo, igorts-git, jesselu-google, jiangjy1982, khatwanimohit, richjames0, shralex, suexu1025 and vipannalla as code owners

March 18, 2026 00:14

codecov bot commented Mar 18, 2026 •

edited

Loading

Codecov Report

❌ Patch coverage is 40.57971% with 41 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...rc/maxtext/input_pipeline/grain_data_processing.py	40.57%	33 Missing and 8 partials ⚠️

📢 Thoughts on this report? Let us know!

ajkv-google added 3 commits

March 18, 2026 00:28


          Updated code formatting and spacing

153953a


          Updated grain sft unit test and sft implementation to use tokenizer

a5949cf


          Cleaned up code for readability

15fcafa

vlad-karp reviewed

View reviewed changes

Collaborator

vlad-karp left a comment

It would also be great to test not only with maxtext general sft but with distillation sft pipeline as well

src/maxtext/input_pipeline/grain_data_processing.py Outdated Show resolved Hide resolved

src/maxtext/input_pipeline/grain_data_processing.py Outdated Show resolved Hide resolved

src/maxtext/input_pipeline/grain_data_processing.py Outdated Show resolved Hide resolved

src/maxtext/input_pipeline/grain_data_processing.py

+                  messages = [{"role": "user", "content": element["prompt"]}, {"role": "assistant", "content": element["completion"]}]
+                elif set(data_columns) == {"question", "answer"}:
+                  messages = [{"role": "user", "content": element["question"]}, {"role": "assistant", "content": element["answer"]}]
+                else:

Collaborator

vlad-karp Mar 19, 2026

HF pipeline asserts sft is running on a conversational format

src/maxtext/input_pipeline/grain_data_processing.py Show resolved Hide resolved

src/maxtext/input_pipeline/grain_data_processing.py Outdated Show resolved Hide resolved

src/maxtext/input_pipeline/grain_data_processing.py Outdated

+                dataset = dataset.batch(batch_size, batch_fn=batch_fn)
+                # Shift inputs for teacher-forced training
+                dataset = dataset.map(

Collaborator

vlad-karp Mar 19, 2026

should it alway be executed in a generic sft_preprocessing_pipeline() ?

Collaborator Author

ajkv-google Mar 23, 2026

I extracted the shifting logic into a generic shift_dataset() helper in data_processing_utils.py and applied it uniformly across both the SFT and Pretrain pipelines.

Collaborator

vlad-karp Mar 23, 2026

my concern was that the comment suggests it is distillation only logic but it is applied always.
What is the meaning of that shift operation? should it only be applied in distillation pipeline?

Collaborator Author

ajkv-google Mar 23, 2026

I see, yeah that comment is a bit misleading. I just meant standard next-token prediction, not distillation. From my understanding, the 1-token shift is required for all autoregressive training such as pretraining and sft to align inputs and targets, and it can be applied in distillation as well.

aireenmei reviewed

View reviewed changes

src/maxtext/input_pipeline/grain_data_processing.py Show resolved Hide resolved

src/maxtext/input_pipeline/grain_data_processing.py Outdated Show resolved Hide resolved

src/maxtext/input_pipeline/grain_data_processing.py

+                ), f"Dataset column names mismatch. Expected columns to match one of {supported_columns}, but got {data_columns}"
+                dataset = dataset.map(
+                    functools.partial(_format_chat_template_grain, data_columns=data_columns, tokenizer_model=tokenizer_model)

Collaborator

aireenmei Mar 20, 2026

The hf pipeline calls instruction_data_processing.convert_to_conversational_format, do we support the same conversion here? https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/input_pipeline/hf_data_processing.py#L261

Collaborator Author

ajkv-google Mar 23, 2026

Yes, we do support the conversion. Because Grain's .map() processes row-by-row (unlike HF dataset operations), I implemented the conversion inside _format_chat_template_grain() above this line. It handles both prompt/completion and question/answer pairs.

ajkv-google added 2 commits

March 23, 2026 18:23


          Added util file to remove code redundancy

a53d66f


          Added the separate utils file for data processing

4c1da46

vlad-karp reviewed

View reviewed changes

src/maxtext/input_pipeline/data_processing_utils.py Outdated

+                elif tokenizer_model.unk_id is not None:
+                  pad_id = tokenizer_model.unk_id
+                else:
+                  pad_id = -1

Collaborator

vlad-karp Mar 23, 2026

I think 0 as the deafult is better

src/maxtext/input_pipeline/data_processing_utils.py Outdated

		return batch_size


		def pack_or_pad_and_batch_dataset(dataset, config, batch_size, pad_id, data_columns, tokenizer_model):

Collaborator

vlad-karp Mar 23, 2026

some simpler name?


          Updated default pad id and updated function to create batched dataset

7ada649


          Updated comment for shifting tokens

731ea61

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

aireenmei aireenmei left review comments

vlad-karp vlad-karp left review comments

SurbhiJainUSC Awaiting requested review from SurbhiJainUSC SurbhiJainUSC is a code owner

richjames0 Awaiting requested review from richjames0 richjames0 is a code owner

shralex Awaiting requested review from shralex shralex is a code owner

NicoGrande Awaiting requested review from NicoGrande NicoGrande is a code owner

gobbleturk Awaiting requested review from gobbleturk gobbleturk is a code owner

khatwanimohit Awaiting requested review from khatwanimohit khatwanimohit is a code owner

bvandermoon Awaiting requested review from bvandermoon bvandermoon is a code owner

vipannalla Awaiting requested review from vipannalla vipannalla is a code owner

RissyRan Awaiting requested review from RissyRan RissyRan is a code owner

gagika Awaiting requested review from gagika gagika is a code owner

hengtaoguo Awaiting requested review from hengtaoguo hengtaoguo is a code owner

A9isha Awaiting requested review from A9isha A9isha is a code owner

NuojCheng Awaiting requested review from NuojCheng NuojCheng is a code owner

jiangjy1982 Awaiting requested review from jiangjy1982 jiangjy1982 is a code owner

suexu1025 Awaiting requested review from suexu1025 suexu1025 is a code owner

jesselu-google Awaiting requested review from jesselu-google jesselu-google is a code owner

dipannita08 Awaiting requested review from dipannita08 dipannita08 is a code owner

igorts-git Awaiting requested review from igorts-git igorts-git is a code owner

At least 2 approving reviews are required to merge this pull request.

Labels

None yet