Skip to content

Conversation

@asolergi-nv
Copy link

Description

In this PR I’m including the MegatronTokenizerWriter, which tokenizes and produces the .bin and .idx files required for training with Megatron and its dataloading solution.

  • The .bin file contains the tokenized documents. We will use 4 bytes per token if the vocabulary size is greater than 2**16; otherwise, we’ll use 2 bytes per token.
  • The .idx file contains metadata about the .bin file, mainly the number of tokenized documents and their lengths. More details about this can be found in the close method of MegatronTokenizerWriter.

At first, I tried creating a CompositeStage using TokenizerStage, but as we already discussed, TokenizerStage caused OOM issues. To address this, I added a batch_size argument that controls how many documents we tokenize at once, write to disk, and then immediately discard.

I’ve also included the tokenizer-test folder, which contains the test.sh script I used to verify that the produced files match those created by Megatron’s preprocess_data.py script. To run the checks, you only need to set the DATA_ROOT folder in the script and execute it; it will clone Megatron, start the Ray server, download and convert the TinyStories dataset to JSONL, tokenize the dataset with 8 different configurations, and finally confirm that the files generated by Curator and the Megatron script match.

We perform this validation using 4 different tokenizers, including in the dataset one sample with all tokenizer-specific special tokens, and toggling the append_eod config (also present in the Megatron script). Of these 4 tokenizers, GPT-2 uses 2 bytes per token since its vocabulary size is ≤ 2**16.

I’m now writing some unit tests, similar to the ones in tests/stages/text/io/writer/test_jsonl.py. I’d like to know whether…

  • I should write any specific documentation
  • I should include any tutorial. Perhaps I could add an option to use the MegatronTokenizerWriter in tutorials/text/tinystories/main.py — what do you think?

Let me know your thoughts!

Usage

# Define the processing stages
stages = [
      # Read the data from the JSONL files
      JsonlReader(
          file_paths="data/raw"
          fields="text",
      ),
      # Tokenize the data
      MegatronTokenizerWriter(
          path="data/tokenized-dataset",
          model_identifier="unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
          append_eod=True,
          text_field="text",
      ),
]

# Create a pipeline with the stages
pipeline = Pipeline(
      name="megatron-tokenize",
      description="Tokenize dataset for Megatron-LM.",
      stages=stages,
)

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>
Signed-off-by: asolergi-nv <asolergibert@nvidia.com>
Signed-off-by: asolergi-nv <asolergibert@nvidia.com>
Signed-off-by: asolergi-nv <asolergibert@nvidia.com>
Signed-off-by: asolergi-nv <asolergibert@nvidia.com>
Signed-off-by: asolergi-nv <asolergibert@nvidia.com>
Signed-off-by: asolergi-nv <asolergibert@nvidia.com>
Signed-off-by: asolergi-nv <asolergibert@nvidia.com>
Signed-off-by: asolergi-nv <asolergibert@nvidia.com>
Signed-off-by: asolergi-nv <asolergibert@nvidia.com>
Signed-off-by: asolergi-nv <asolergibert@nvidia.com>
Signed-off-by: asolergi-nv <asolergibert@nvidia.com>
Signed-off-by: asolergi-nv <asolergibert@nvidia.com>
Signed-off-by: asolergi-nv <asolergibert@nvidia.com>
Signed-off-by: asolergi-nv <asolergibert@nvidia.com>
Signed-off-by: asolergi-nv <asolergibert@nvidia.com>
Signed-off-by: asolergi-nv <asolergibert@nvidia.com>
Signed-off-by: asolergi-nv <asolergibert@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment about cloud testing but overall looks great! Thanks a lot @asolergi-nv

help="Path to folder containing Parquet files",
)
group.add_argument(
"--output-path",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to test if this works when output_path is a cloud path like s3? Things should work but we've been bitten by not testing cloud IO better for other tutorials in the past and are in the process of updating it for this release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants