[Tutorial] Merge multiple file prefixes generated by MegatronTokenizerWriter#1427
Conversation
Signed-off-by: asolergi-nv <asolergibert@nvidia.com>
Greptile SummaryThis PR moves the Megatron-LM dataset merge logic into Confidence Score: 5/5Safe to merge — all previously flagged P1 issues are resolved and remaining findings are P2 style suggestions. All critical bugs from earlier review rounds (builder None-crash, broken path join, empty dirname edge case) are correctly fixed. The three new findings are P2: one is a memory-efficiency suggestion (unused sequence_pointers array), one is a missing error message on dtype assertion, and one is a test import coupling. None affect correctness or runtime behaviour. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A([merge_file_prefixes called]) --> B[os.listdir input_dir]
B --> C{ext in .bin .idx?}
C -- No --> B
C -- Yes --> D{prefix already seen?}
D -- Yes --> B
D -- No --> E{isfile check}
E -- No --> B
E -- Yes --> F{paired ext exists?}
F -- No --> G([AssertionError — missing pair])
F -- Yes --> H[Add prefix to set]
H --> B
B --> I{prefixes empty?}
I -- Yes --> J([ValueError — no valid pairs])
I -- No --> K[Sort prefixes]
K --> L[First prefix: extract dtype from .idx]
L --> M[Create IndexedDatasetBuilder — open output .bin]
M --> N[For each prefix in sorted order]
N --> O[extract_index_contents — read .idx]
O --> P{dtype matches?}
P -- No --> Q([AssertionError — dtype mismatch])
P -- Yes --> R[Extend sequence_lengths and document_indices]
R --> S[shutil.copyfileobj — append .bin data]
S --> N
N --> T[builder.finalize — close .bin, write .idx via _IndexWriter]
T --> U([Done — output_prefix.bin and .idx written])
Reviews (8): Last reviewed commit: "Merge branch 'main' into merge_datasets" | Re-trigger Greptile |
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Antoni-Joan Solergibert <asolergibert@nvidia.com>
sarahyurick
left a comment
There was a problem hiding this comment.
Thanks! I added some minor requests and left some comments on the greptile. Let me know what you think.
| @@ -0,0 +1,277 @@ | |||
| """ | |||
| Simplified version of the tools/merge_datasets.py script from the Megatron-LM library. | |||
Signed-off-by: asolergi-nv <asolergibert@nvidia.com>
|
Hi @asolergi-nv can you add some pytests for this? |
sarahyurick
left a comment
There was a problem hiding this comment.
LGTM thank you @asolergi-nv !
|
/ok to test fe25bb4 |
Description
Addresses #1399. Adds
tutorials/text/megatron-tokenizer/merge_datasets.pyscript to merge multiple file prefixes generated byMegatronTokenizerWriterinto a single one . It is a simplified version oftools/merge_datasets.pyscript from the Megatron-LM library. Tested with the 2 different vocab sizes (≤65535, >65535).Usage
python tutorials/text/megatron-tokenizer/merge_datasets.py \ --input-dir /path/to/tokenized/files \ --output-prefix /path/to/output/mergedChecklist