[megatron]support hybridep, Add moe_flex_dispatcher_backend argument and hybridep padding fix#8393
[megatron]support hybridep, Add moe_flex_dispatcher_backend argument and hybridep padding fix#8393tingmingzhong wants to merge 1 commit intomodelscope:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical hang issue encountered when the MoE flex token dispatcher is used with the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for the hybridep backend for the MoE flex token dispatcher in Megatron. This is achieved by adding a new moe_flex_dispatcher_backend argument and modifying the padding logic in get_padding_to to prevent hangs, which is a solid improvement. The inclusion of a comprehensive new test suite for get_padding_to is particularly commendable, as it covers the new logic thoroughly and improves the overall robustness of the utility function. My review includes a couple of minor suggestions for the new test file to improve its long-term maintainability by ensuring the tested code is an exact mirror of the production code.
| if args.context_parallel_size > 1: | ||
| padding_to = (padding_to or 1) * args.context_parallel_size | ||
| origin_padding_to = padding_to | ||
| fp8_format: Optional[str] = args.fp8_format or args.fp8 |
There was a problem hiding this comment.
To keep the inlined function perfectly in sync with the source code in swift/megatron/utils/utils.py, it's better to use getattr for safety. This ensures the test accurately reflects the production code's behavior, improving maintainability.
| fp8_format: Optional[str] = args.fp8_format or args.fp8 | |
| fp8_format = getattr(args, 'fp8_format', None) or getattr(args, 'fp8', None) |
| # padding to max seq_length to avoid hybridep all-gather-into-tensor hang | ||
| moe_backend: Optional[str] = getattr(args, 'moe_flex_dispatcher_backend', None) | ||
| if moe_backend == 'hybridep': | ||
| seq_length: Optional[int] = args.seq_length or args.max_length |
There was a problem hiding this comment.
To keep the inlined function perfectly in sync with the source code in swift/megatron/utils/utils.py, it's better to use getattr for safety. This ensures the test accurately reflects the production code's behavior, improving maintainability.
| seq_length: Optional[int] = args.seq_length or args.max_length | |
| seq_length = getattr(args, 'seq_length', None) or getattr(args, 'max_length', None) |
Description:
What does this PR do?
This PR adds a new moe_flex_dispatcher_backend argument to MegatronArguments and fixes a hang issue when using the hybridep backend with the MoE flex token dispatcher.
Background
When using moe_token_dispatcher_type='flex' with the hybridep backend, the all-gather-into-tensor operation can hang due to inconsistent sequence lengths across ranks. This is resolved by padding all sequences to seq_length (or max_length) when moe_flex_dispatcher_backend='hybridep' is set.
Changes
swift/megatron/arguments/megatron_args.py: Added moe_flex_dispatcher_backend field with options 'deepep' and 'hybridep' (default None).
swift/megatron/utils/utils.py: Extended get_padding_to() to set padding_to = seq_length when moe_flex_dispatcher_backend == 'hybridep', preventing the all-gather-into-tensor hang.
tests/megatron/test_utils.py: Added 20 unit tests covering all get_padding_to() branches including the new hybridep logic.
Impact
Only affects the Megatron training path when moe_flex_dispatcher_backend='hybridep' is explicitly set. No impact on other configurations.
Experiment results
When use ms-swift training Qwen3-30B-A3B on a single-node 8-GPU B200 machine with TP=1, PP=1, and EP=8, HybridEP achieves nearly a 30% performance improvement.