Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates the Muon optimizer into the Megatron-SWIFT framework, providing users with additional advanced optimization options for training large language models. It updates both the core argument parsing logic and the user-facing documentation to reflect the new optimizer types and their configurable parameters, along with initial compatibility constraints. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds support for the Muon optimizer, including muon and dist_muon options. The changes include updating the command-line arguments in both Python code and documentation.
My review found a critical issue in the argument validation logic that would prevent dist_muon from being used with its default settings. I've provided a suggestion to fix this. Additionally, I've pointed out a minor omission in the English documentation for the optimizer parameter. Overall, the changes are good but the validation logic needs to be corrected.
| assert not self.overlap_grad_reduce, 'Muon optimizer does not support overlap grad reduce for now.' | ||
| assert not self.overlap_param_gather, 'Muon optimizer does not support overlap param gather for now.' | ||
|
|
||
| assert not self.use_distributed_optimizer, 'Muon optimizer does not support distributed optimizer for now.' |
There was a problem hiding this comment.
The current validation logic for the Muon optimizer incorrectly blocks the use of dist_muon. The check assert not self.use_distributed_optimizer applies to any optimizer with "muon" in its name, including dist_muon. Since use_distributed_optimizer defaults to True, this makes dist_muon unusable by default, which seems to contradict its purpose.
The validation should differentiate between muon and dist_muon. The muon optimizer should not be used with the distributed optimizer, but dist_muon presumably should.
Here is a suggested correction to fix this logic and provide more informative error messages.
| assert not self.overlap_grad_reduce, 'Muon optimizer does not support overlap grad reduce for now.' | |
| assert not self.overlap_param_gather, 'Muon optimizer does not support overlap param gather for now.' | |
| assert not self.use_distributed_optimizer, 'Muon optimizer does not support distributed optimizer for now.' | |
| assert not self.overlap_grad_reduce, 'Muon optimizers do not support overlap grad reduce for now.' | |
| assert not self.overlap_param_gather, 'Muon optimizers do not support overlap param gather for now.' | |
| if self.optimizer == 'muon': | |
| assert not self.use_distributed_optimizer, \ | |
| 'The "muon" optimizer does not support the distributed optimizer. ' \ | |
| 'Use "dist_muon" for distributed training or set "--use_distributed_optimizer false".' |
| - Some models may not support flash attention; you need to manually set `--attention_backend unfused/fused --padding_free false`, for example: Llama4, GPT-OSS. | ||
| - If `flash_attention_3` is installed, specifying `--attention_backend flash` will prioritize using FA3. Refer to the training script [here](https://github.com/modelscope/ms-swift/tree/main/examples/train/flash_attention_3). | ||
| - optimizer: Optimizer type, options are 'adam', 'sgd'. Default is adam. | ||
| - optimizer: Optimizer type. Options include 'adam', 'sgd', 'muon', and 'dist_muon'. |
There was a problem hiding this comment.
The description for the optimizer parameter is missing the default value. For clarity and consistency with other parameter descriptions and the Chinese documentation, please specify that the default optimizer is 'adam'.
| - optimizer: Optimizer type. Options include 'adam', 'sgd', 'muon', and 'dist_muon'. | |
| - optimizer: Optimizer type. Options include 'adam', 'sgd', 'muon', and 'dist_muon'. Default is adam. |
No description provided.