Megatron tokenization pipeline #1259

asolergi-nv · 2025-11-20T18:59:05Z

Description

In this PR I’m including the MegatronTokenizerWriter, which tokenizes and produces the .bin and .idx files required for training with Megatron and its dataloading solution.

The .bin file contains the tokenized documents. We will use 4 bytes per token if the vocabulary size is greater than 2**16; otherwise, we’ll use 2 bytes per token.
The .idx file contains metadata about the .bin file, mainly the number of tokenized documents and their lengths. More details about this can be found in the close method of MegatronTokenizerWriter.

At first, I tried creating a CompositeStage using TokenizerStage, but as we already discussed, TokenizerStage caused OOM issues. To address this, I added a batch_size argument that controls how many documents we tokenize at once, write to disk, and then immediately discard.

I’ve also included the tokenizer-test folder, which contains the test.sh script I used to verify that the produced files match those created by Megatron’s preprocess_data.py script. To run the checks, you only need to set the DATA_ROOT folder in the script and execute it; it will clone Megatron, start the Ray server, download and convert the TinyStories dataset to JSONL, tokenize the dataset with 8 different configurations, and finally confirm that the files generated by Curator and the Megatron script match.

We perform this validation using 4 different tokenizers, including in the dataset one sample with all tokenizer-specific special tokens, and toggling the append_eod config (also present in the Megatron script). Of these 4 tokenizers, GPT-2 uses 2 bytes per token since its vocabulary size is ≤ 2**16.

I’m now writing some unit tests, similar to the ones in tests/stages/text/io/writer/test_jsonl.py. I’d like to know whether…

I should write any specific documentation
I should include any tutorial. Perhaps I could add an option to use the MegatronTokenizerWriter in tutorials/text/tinystories/main.py — what do you think?

Let me know your thoughts!

Usage

# Define the processing stages
stages = [
      # Read the data from the JSONL files
      JsonlReader(
          file_paths="data/raw"
          fields="text",
      ),
      # Tokenize the data
      MegatronTokenizerWriter(
          path="data/tokenized-dataset",
          model_identifier="unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
          append_eod=True,
          text_field="text",
      ),
]

# Create a pipeline with the stages
pipeline = Pipeline(
      name="megatron-tokenize",
      description="Tokenize dataset for Megatron-LM.",
      stages=stages,
)

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

greptile-apps

_{9 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

ayushdg

Minor comment about cloud testing but overall looks great! Thanks a lot @asolergi-nv

ayushdg · 2025-12-03T15:09:05Z

tutorials/text/megatron-tokenizer/main.py

+        help="Path to folder containing Parquet files",
+    )
+    group.add_argument(
+        "--output-path",


Is it possible to test if this works when output_path is a cloud path like s3? Things should work but we've been bitten by not testing cloud IO better for other tutorials in the past and are in the process of updating it for this release.

asolergi-nv added 20 commits November 11, 2025 13:31

Add token_size to TokenizerStage metadata

04ee65a

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

First draft

5c6a5b5

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

bugs

637589b

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

First working prototype

63514a0

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

Guard against missing vocab_size and eos_token_id

53d0298

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

Merge branch 'NVIDIA-NeMo:main' into megatron_tokenizer

06f9972

Before fixing OOM

a1df6c0

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

No OOM!

d96fb94

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

Merge branch 'NVIDIA-NeMo:main' into megatron_tokenizer

205bc4f

Undo tokenizer changes

ef36d4d

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

Match!

c991006

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

A bit of cleaning

f97b7db

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

move batched to utils

2166cc0

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

v4: Remove document indices list

5dccc22

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

v5: Larger writes

6d106fe

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

Ready

146ac2b

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

Add scripts checks

af8cbc4

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

Remove comments

52bd4c5

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

nits

e52a7b3

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

More nits

300a9b9

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

copy-pr-bot bot temporarily deployed to test November 20, 2025 18:59 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 20, 2025 18:59 Inactive

copy-pr-bot bot temporarily deployed to test December 3, 2025 15:02 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci December 3, 2025 15:02 Inactive

greptile-apps bot reviewed Dec 3, 2025

View reviewed changes

ayushdg approved these changes Dec 3, 2025

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci December 3, 2025 15:21 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Megatron tokenization pipeline #1259

Megatron tokenization pipeline #1259

Uh oh!

asolergi-nv commented Nov 20, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

ayushdg left a comment

Uh oh!

ayushdg Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Megatron tokenization pipeline #1259

Are you sure you want to change the base?

Megatron tokenization pipeline #1259

Uh oh!

Conversation

asolergi-nv commented Nov 20, 2025

Description

Usage

Checklist

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

ayushdg left a comment

Choose a reason for hiding this comment

Uh oh!

ayushdg Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants