[Tutorial] Merge multiple file prefixes generated by `MegatronTokenizerWriter` by asolergi-nv · Pull Request #1427 · NVIDIA-NeMo/Curator

asolergi-nv · 2026-01-26T11:19:12Z

Description

Addresses #1399. Adds tutorials/text/megatron-tokenizer/merge_datasets.py script to merge multiple file prefixes generated by MegatronTokenizerWriter into a single one . It is a simplified version of tools/merge_datasets.py script from the Megatron-LM library. Tested with the 2 different vocab sizes (≤65535, >65535).

Usage

python tutorials/text/megatron-tokenizer/merge_datasets.py \
    --input-dir /path/to/tokenized/files \
    --output-prefix /path/to/output/merged

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

greptile-apps · 2026-01-26T11:21:57Z

Greptile Summary

This PR moves the Megatron-LM dataset merge logic into nemo_curator/utils/merge_file_prefixes.py (a library-importable module) and adds a thorough test suite. The previously raised P1 concerns — builder null-crash on empty input, incorrect path construction with string concatenation, and os.path.dirname returning "" for bare filenames — have all been addressed correctly in this version.

Confidence Score: 5/5

Safe to merge — all previously flagged P1 issues are resolved and remaining findings are P2 style suggestions.

All critical bugs from earlier review rounds (builder None-crash, broken path join, empty dirname edge case) are correctly fixed. The three new findings are P2: one is a memory-efficiency suggestion (unused sequence_pointers array), one is a missing error message on dtype assertion, and one is a test import coupling. None affect correctness or runtime behaviour.

No files require special attention.

Important Files Changed

Filename	Overview
nemo_curator/utils/merge_file_prefixes.py	New utility module that merges Megatron-LM IndexedDataset file prefixes; previously flagged P1 issues (builder None-crash, bad path join, os.path.dirname edge case) are all resolved, leaving only minor style issues.
tests/utils/test_merge_file_prefixes.py	Comprehensive end-to-end test suite covering happy paths (parametrised over batch count, vocab size, EOD flag), single-prefix round-trip, empty directory, and orphan-pair error cases; _INDEX_HEADER is imported from the wrong module which creates a minor test-vs-impl coupling risk.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([merge_file_prefixes called]) --> B[os.listdir input_dir]
    B --> C{ext in .bin .idx?}
    C -- No --> B
    C -- Yes --> D{prefix already seen?}
    D -- Yes --> B
    D -- No --> E{isfile check}
    E -- No --> B
    E -- Yes --> F{paired ext exists?}
    F -- No --> G([AssertionError — missing pair])
    F -- Yes --> H[Add prefix to set]
    H --> B
    B --> I{prefixes empty?}
    I -- Yes --> J([ValueError — no valid pairs])
    I -- No --> K[Sort prefixes]
    K --> L[First prefix: extract dtype from .idx]
    L --> M[Create IndexedDatasetBuilder — open output .bin]
    M --> N[For each prefix in sorted order]
    N --> O[extract_index_contents — read .idx]
    O --> P{dtype matches?}
    P -- No --> Q([AssertionError — dtype mismatch])
    P -- Yes --> R[Extend sequence_lengths and document_indices]
    R --> S[shutil.copyfileobj — append .bin data]
    S --> N
    N --> T[builder.finalize — close .bin, write .idx via _IndexWriter]
    T --> U([Done — output_prefix.bin and .idx written])

_{Reviews (8): Last reviewed commit: "Merge branch 'main' into merge_datasets" | Re-trigger Greptile}

greptile-apps

_{3 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Antoni-Joan Solergibert <asolergibert@nvidia.com>

greptile-apps

_{5 files reviewed, 5 comments}

_{Edit Code Review Agent Settings | Greptile}

sarahyurick

Thanks! I added some minor requests and left some comments on the greptile. Let me know what you think.

sarahyurick · 2026-01-26T18:10:33Z

@@ -0,0 +1,277 @@
+"""
+Simplified version of the tools/merge_datasets.py script from the Megatron-LM library.


Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

sarahyurick · 2026-04-20T18:21:30Z

Hi @asolergi-nv can you add some pytests for this?

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

sarahyurick

LGTM thank you @asolergi-nv !

sarahyurick · 2026-04-22T16:59:21Z

/ok to test fe25bb4

Add merge datasets tutorial

c348dda

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

greptile-apps Bot reviewed Jan 26, 2026

View reviewed changes

Comment thread tutorials/text/megatron-tokenizer/merge_datasets.py Outdated

Comment thread tutorials/text/megatron-tokenizer/merge_datasets.py Outdated

Comment thread nemo_curator/utils/merge_file_prefixes.py

Update tutorials/text/megatron-tokenizer/merge_datasets.py

3308c6e

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Antoni-Joan Solergibert <asolergibert@nvidia.com>

greptile-apps Bot reviewed Jan 26, 2026

View reviewed changes

sarahyurick reviewed Jan 26, 2026

View reviewed changes

sarahyurick mentioned this pull request Apr 13, 2026

Nemotron-CLIMB data curation recipe #1727

Open

5 tasks

Merge branch 'main' into merge_datasets

203b0de

asolergi-nv requested a review from a team as a code owner April 13, 2026 21:17

asolergi-nv requested review from praateekmahajan and removed request for a team April 13, 2026 21:17

copy-pr-bot Bot temporarily deployed to test April 13, 2026 21:18 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 13, 2026 21:18 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 13, 2026 21:27 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 16, 2026 19:16 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 16, 2026 19:28 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci April 16, 2026 19:28 Failure

copy-pr-bot Bot temporarily deployed to nemo-ci April 16, 2026 19:28 Inactive

Review comments, move and rename merge file prefixes script

224825a

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

copy-pr-bot Bot temporarily deployed to test April 16, 2026 21:28 Inactive

Merge branch 'main' into merge_datasets

a83140d

Add tests

9429733

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

sarahyurick approved these changes Apr 22, 2026

View reviewed changes

Merge branch 'main' into merge_datasets

fe25bb4

		@@ -0,0 +1,277 @@
		"""
		Simplified version of the tools/merge_datasets.py script from the Megatron-LM library.

Conversation

asolergi-nv commented Jan 26, 2026

Description

Usage

Checklist

Uh oh!

greptile-apps Bot commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sarahyurick left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sarahyurick Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sarahyurick commented Apr 20, 2026

Uh oh!

sarahyurick left a comment

Choose a reason for hiding this comment

Uh oh!

sarahyurick commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Jan 26, 2026 •

edited

Loading