Skip to content

Enhancement - Fix preprocessing crash when num_workers=0 in Megatron GPT3 dataset generation#800

Open
polarG wants to merge 8 commits into
mainfrom
dev/hongtaozhang/fix-num_workers-usage
Open

Enhancement - Fix preprocessing crash when num_workers=0 in Megatron GPT3 dataset generation#800
polarG wants to merge 8 commits into
mainfrom
dev/hongtaozhang/fix-num_workers-usage

Conversation

@polarG
Copy link
Copy Markdown
Contributor

@polarG polarG commented Apr 1, 2026

Description
preprocess_data.py uses multiprocessing.Pool(workers), which requires workers >= 1. When num_workers is set to 0 (valid for DataLoader, where it means "load in main process"), the preprocessing step crashes.

This change clamps the worker count to max(1, self._args.num_workers) before passing it to preprocess_data.py, while leaving the original num_workers value unchanged for other uses like DataLoader.

@polarG polarG requested a review from a team as a code owner April 1, 2026 06:07
Copilot AI review requested due to automatic review settings April 1, 2026 06:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a crash in the Megatron GPT-3 dataset preprocessing path when --num_workers=0 by ensuring the worker count passed to Megatron’s tools/preprocess_data.py is at least 1, while keeping num_workers unchanged for other consumers (e.g., PyTorch DataLoader).

Changes:

  • Clamp --workers for preprocess_data.py to max(1, self._args.num_workers) to avoid multiprocessing.Pool(0) failures.
  • Add inline clarification comment documenting why preprocessing clamps workers while other components may allow num_workers=0.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread superbench/benchmarks/model_benchmarks/megatron_gpt3.py
Comment thread superbench/benchmarks/model_benchmarks/megatron_gpt3.py Outdated
@polarG polarG self-assigned this Apr 17, 2026
@polarG polarG added enhancement New feature or request ROCm labels Apr 27, 2026
- Megatron's preprocess_data.py appends '_text_document' to the
  --output-prefix when producing the .bin/.idx files. Derive the
  output-prefix from --data_prefix (stripping a trailing
  '_text_document' suffix when present) so that the generated files
  match the existence checks for any custom data_prefix value, instead
  of being hardcoded to '<data_home>/dataset'.
- Add unit test test_megatron_gpt_dataset_generate_command covering:
  num_workers=0 clamps to '--workers 1' with default data_prefix
  ('dataset_text_document' -> '<data_home>/dataset'), num_workers=4
  with custom 'custom_text_document' -> '<data_home>/custom', and a
  data_prefix without the suffix used as-is.
Copilot AI review requested due to automatic review settings April 27, 2026 22:02
Avoid '--workers 1' matching '--workers 10' etc.
@polarG polarG force-pushed the dev/hongtaozhang/fix-num_workers-usage branch from a475b85 to ee2839e Compare April 27, 2026 22:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/benchmarks/model_benchmarks/test_megatron_gpt.py Outdated
Comment thread tests/benchmarks/model_benchmarks/test_megatron_gpt.py Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.82%. Comparing base (700d650) to head (d580e6e).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #800      +/-   ##
==========================================
+ Coverage   85.69%   85.82%   +0.13%     
==========================================
  Files         103      103              
  Lines        7890     7900      +10     
==========================================
+ Hits         6761     6780      +19     
+ Misses       1129     1120       -9     
Flag Coverage Δ
cpu-python3.10-unit-test 70.57% <100.00%> (+0.15%) ⬆️
cpu-python3.12-unit-test 70.57% <100.00%> (+0.15%) ⬆️
cpu-python3.7-unit-test 70.00% <100.00%> (+0.15%) ⬆️
cuda-unit-test 83.73% <100.00%> (+0.13%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Hongtao Zhang added 2 commits May 2, 2026 00:19
…et generate test

- Replace brittle whitespace substring assertions ('--workers 1 ', '--output-prefix ... ') with normalize_command()-based parsed CLI unit checks, so the test validates semantics rather than formatting.
- Use --code_base {self._tmp_dir} together with createMockFiles(['pretrain_gpt.py']) to avoid the unrealistic /root/Megatron-DeepSpeed path. The mocked run_command now creates the expected .bin/.idx files via side_effect so _preprocess() succeeds end-to-end and is asserted to be True.
- Warn when num_workers is silently clamped from 0 to 1 for the
  preprocess subprocess so the user sees the override in the log.
  The DataLoader still receives the original num_workers value.
- Guard the '_text_document' suffix-strip against the edge case where
  data_prefix == '_text_document' (which would otherwise produce a
  malformed --output-prefix '<data_home>/' with an empty basename).
  Fall back to the original data_prefix value in that case.
- Extend test_megatron_gpt_dataset_generate_command with a 4th case
  asserting the empty-basename fallback.
Copilot AI review requested due to automatic review settings May 3, 2026 05:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread superbench/benchmarks/model_benchmarks/megatron_gpt3.py Outdated
Comment thread tests/benchmarks/model_benchmarks/test_megatron_gpt.py Outdated
polarG and others added 2 commits May 15, 2026 13:18
- Fix test pollution: clean up pretrain_gpt.py created by
  test_megatron_gpt_dataset_generate_command so it does not break the
  alphabetically-later test_megatron_gpt_preprocess (root cause of CI
  cpu-unit-test failures on python-3.7/3.10/3.12).
- Fix Python 3.7 incompatibility: use call_args_list[0][0][0] instead
  of .args[0] (mock.call.args is Python 3.8+).
- Enforce data_prefix contract in _generate_dataset(): validate that
  data_prefix ends with '_text_document' and has a non-empty stem
  before invoking preprocess_data.py. Fail fast with a clear error
  otherwise (preprocess_data.py always appends '_text_document' to
  --output-prefix, so other prefixes would silently produce files
  whose names mismatch the existence check).
- Replace test Cases 3 and 4 (which masked the contract bug via a
  side_effect that lied about preprocess_data.py output) with
  negative cases that assert _preprocess() fails fast and never
  invokes run_command for invalid data_prefix values.
Copilot AI review requested due to automatic review settings May 15, 2026 20:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

- Collapse multi-line .format() call in _generate_dataset error message.
- Add blank line before 'return _side_effect' in nested function in test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request ROCm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants