Enhancement - Fix preprocessing crash when num_workers=0 in Megatron GPT3 dataset generation#800
Enhancement - Fix preprocessing crash when num_workers=0 in Megatron GPT3 dataset generation#800polarG wants to merge 8 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a crash in the Megatron GPT-3 dataset preprocessing path when --num_workers=0 by ensuring the worker count passed to Megatron’s tools/preprocess_data.py is at least 1, while keeping num_workers unchanged for other consumers (e.g., PyTorch DataLoader).
Changes:
- Clamp
--workersforpreprocess_data.pytomax(1, self._args.num_workers)to avoidmultiprocessing.Pool(0)failures. - Add inline clarification comment documenting why preprocessing clamps workers while other components may allow
num_workers=0.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Megatron's preprocess_data.py appends '_text_document' to the
--output-prefix when producing the .bin/.idx files. Derive the
output-prefix from --data_prefix (stripping a trailing
'_text_document' suffix when present) so that the generated files
match the existence checks for any custom data_prefix value, instead
of being hardcoded to '<data_home>/dataset'.
- Add unit test test_megatron_gpt_dataset_generate_command covering:
num_workers=0 clamps to '--workers 1' with default data_prefix
('dataset_text_document' -> '<data_home>/dataset'), num_workers=4
with custom 'custom_text_document' -> '<data_home>/custom', and a
data_prefix without the suffix used as-is.
Avoid '--workers 1' matching '--workers 10' etc.
a475b85 to
ee2839e
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #800 +/- ##
==========================================
+ Coverage 85.69% 85.82% +0.13%
==========================================
Files 103 103
Lines 7890 7900 +10
==========================================
+ Hits 6761 6780 +19
+ Misses 1129 1120 -9
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…et generate test
- Replace brittle whitespace substring assertions ('--workers 1 ', '--output-prefix ... ') with normalize_command()-based parsed CLI unit checks, so the test validates semantics rather than formatting.
- Use --code_base {self._tmp_dir} together with createMockFiles(['pretrain_gpt.py']) to avoid the unrealistic /root/Megatron-DeepSpeed path. The mocked run_command now creates the expected .bin/.idx files via side_effect so _preprocess() succeeds end-to-end and is asserted to be True.
- Warn when num_workers is silently clamped from 0 to 1 for the preprocess subprocess so the user sees the override in the log. The DataLoader still receives the original num_workers value. - Guard the '_text_document' suffix-strip against the edge case where data_prefix == '_text_document' (which would otherwise produce a malformed --output-prefix '<data_home>/' with an empty basename). Fall back to the original data_prefix value in that case. - Extend test_megatron_gpt_dataset_generate_command with a 4th case asserting the empty-basename fallback.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Fix test pollution: clean up pretrain_gpt.py created by test_megatron_gpt_dataset_generate_command so it does not break the alphabetically-later test_megatron_gpt_preprocess (root cause of CI cpu-unit-test failures on python-3.7/3.10/3.12). - Fix Python 3.7 incompatibility: use call_args_list[0][0][0] instead of .args[0] (mock.call.args is Python 3.8+). - Enforce data_prefix contract in _generate_dataset(): validate that data_prefix ends with '_text_document' and has a non-empty stem before invoking preprocess_data.py. Fail fast with a clear error otherwise (preprocess_data.py always appends '_text_document' to --output-prefix, so other prefixes would silently produce files whose names mismatch the existence check). - Replace test Cases 3 and 4 (which masked the contract bug via a side_effect that lied about preprocess_data.py output) with negative cases that assert _preprocess() fails fast and never invokes run_command for invalid data_prefix values.
- Collapse multi-line .format() call in _generate_dataset error message. - Add blank line before 'return _side_effect' in nested function in test.
Description
preprocess_data.py uses multiprocessing.Pool(workers), which requires workers >= 1. When num_workers is set to 0 (valid for DataLoader, where it means "load in main process"), the preprocessing step crashes.
This change clamps the worker count to max(1, self._args.num_workers) before passing it to preprocess_data.py, while leaving the original num_workers value unchanged for other uses like DataLoader.