Skip to content

[Tutorial] Merge multiple file prefixes generated by MegatronTokenizerWriter#1427

Merged
sarahyurick merged 9 commits intoNVIDIA-NeMo:mainfrom
asolergi-nv:merge_datasets
Apr 22, 2026
Merged

[Tutorial] Merge multiple file prefixes generated by MegatronTokenizerWriter#1427
sarahyurick merged 9 commits intoNVIDIA-NeMo:mainfrom
asolergi-nv:merge_datasets

Conversation

@asolergi-nv
Copy link
Copy Markdown
Contributor

Description

Addresses #1399. Adds tutorials/text/megatron-tokenizer/merge_datasets.py script to merge multiple file prefixes generated by MegatronTokenizerWriter into a single one . It is a simplified version of tools/merge_datasets.py script from the Megatron-LM library. Tested with the 2 different vocab sizes (≤65535, >65535).

Usage

python tutorials/text/megatron-tokenizer/merge_datasets.py \
    --input-dir /path/to/tokenized/files \
    --output-prefix /path/to/output/merged

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jan 26, 2026

Greptile Summary

This PR moves the Megatron-LM dataset merge logic into nemo_curator/utils/merge_file_prefixes.py (a library-importable module) and adds a thorough test suite. The previously raised P1 concerns — builder null-crash on empty input, incorrect path construction with string concatenation, and os.path.dirname returning "" for bare filenames — have all been addressed correctly in this version.

Confidence Score: 5/5

Safe to merge — all previously flagged P1 issues are resolved and remaining findings are P2 style suggestions.

All critical bugs from earlier review rounds (builder None-crash, broken path join, empty dirname edge case) are correctly fixed. The three new findings are P2: one is a memory-efficiency suggestion (unused sequence_pointers array), one is a missing error message on dtype assertion, and one is a test import coupling. None affect correctness or runtime behaviour.

No files require special attention.

Important Files Changed

Filename Overview
nemo_curator/utils/merge_file_prefixes.py New utility module that merges Megatron-LM IndexedDataset file prefixes; previously flagged P1 issues (builder None-crash, bad path join, os.path.dirname edge case) are all resolved, leaving only minor style issues.
tests/utils/test_merge_file_prefixes.py Comprehensive end-to-end test suite covering happy paths (parametrised over batch count, vocab size, EOD flag), single-prefix round-trip, empty directory, and orphan-pair error cases; _INDEX_HEADER is imported from the wrong module which creates a minor test-vs-impl coupling risk.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([merge_file_prefixes called]) --> B[os.listdir input_dir]
    B --> C{ext in .bin .idx?}
    C -- No --> B
    C -- Yes --> D{prefix already seen?}
    D -- Yes --> B
    D -- No --> E{isfile check}
    E -- No --> B
    E -- Yes --> F{paired ext exists?}
    F -- No --> G([AssertionError — missing pair])
    F -- Yes --> H[Add prefix to set]
    H --> B
    B --> I{prefixes empty?}
    I -- Yes --> J([ValueError — no valid pairs])
    I -- No --> K[Sort prefixes]
    K --> L[First prefix: extract dtype from .idx]
    L --> M[Create IndexedDatasetBuilder — open output .bin]
    M --> N[For each prefix in sorted order]
    N --> O[extract_index_contents — read .idx]
    O --> P{dtype matches?}
    P -- No --> Q([AssertionError — dtype mismatch])
    P -- Yes --> R[Extend sequence_lengths and document_indices]
    R --> S[shutil.copyfileobj — append .bin data]
    S --> N
    N --> T[builder.finalize — close .bin, write .idx via _IndexWriter]
    T --> U([Done — output_prefix.bin and .idx written])
Loading

Reviews (8): Last reviewed commit: "Merge branch 'main' into merge_datasets" | Re-trigger Greptile

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment thread tutorials/text/megatron-tokenizer/merge_datasets.py Outdated
Comment thread tutorials/text/megatron-tokenizer/merge_datasets.py Outdated
Comment thread nemo_curator/utils/merge_file_prefixes.py
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Antoni-Joan Solergibert <asolergibert@nvidia.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

Comment thread nemo_curator/utils/merge_file_prefixes.py
Comment thread nemo_curator/utils/merge_file_prefixes.py
Comment thread tutorials/text/megatron-tokenizer/merge_datasets.py Outdated
Comment thread nemo_curator/utils/merge_file_prefixes.py
Comment thread nemo_curator/utils/merge_file_prefixes.py
Copy link
Copy Markdown
Contributor

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I added some minor requests and left some comments on the greptile. Let me know what you think.

Comment thread tutorials/text/megatron-tokenizer/README.md Outdated
Comment thread nemo_curator/utils/merge_file_prefixes.py
@@ -0,0 +1,277 @@
"""
Simplified version of the tools/merge_datasets.py script from the Megatron-LM library.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link?

Comment thread tutorials/text/megatron-tokenizer/merge_datasets.py Outdated
Comment thread tutorials/text/megatron-tokenizer/merge_datasets.py Outdated
Comment thread nemo_curator/utils/merge_file_prefixes.py
@asolergi-nv asolergi-nv requested a review from a team as a code owner April 13, 2026 21:17
@asolergi-nv asolergi-nv requested review from praateekmahajan and removed request for a team April 13, 2026 21:17
Signed-off-by: asolergi-nv <asolergibert@nvidia.com>
@sarahyurick
Copy link
Copy Markdown
Contributor

Hi @asolergi-nv can you add some pytests for this?

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>
Copy link
Copy Markdown
Contributor

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thank you @asolergi-nv !

@sarahyurick
Copy link
Copy Markdown
Contributor

/ok to test fe25bb4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants