Skip to content

[megatron] support muon#8392

Open
Jintao-Huang wants to merge 3 commits intomodelscope:mainfrom
Jintao-Huang:support_muon_123
Open

[megatron] support muon#8392
Jintao-Huang wants to merge 3 commits intomodelscope:mainfrom
Jintao-Huang:support_muon_123

Conversation

@Jintao-Huang
Copy link
Collaborator

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the Muon optimizer into the Megatron-SWIFT framework, providing users with additional advanced optimization options for training large language models. It updates both the core argument parsing logic and the user-facing documentation to reflect the new optimizer types and their configurable parameters, along with initial compatibility constraints.

Highlights

  • New Optimizer Support: Added 'muon' and 'dist_muon' as new optimizer types, expanding the available optimization strategies for training.
  • Muon Optimizer Parameters: Introduced a comprehensive set of command-line parameters for configuring the Muon optimizer, including momentum, QKV splitting, Nesterov usage, scale mode, FP32 matrix multiplication precision, Newton-Schulz steps, tensor parallel mode, and an extra scale factor.
  • Compatibility Checks: Implemented validation checks to ensure that the Muon optimizer is not used with overlap_grad_reduce, overlap_param_gather, or use_distributed_optimizer in its current state.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the Muon optimizer, including muon and dist_muon options. The changes include updating the command-line arguments in both Python code and documentation.

My review found a critical issue in the argument validation logic that would prevent dist_muon from being used with its default settings. I've provided a suggestion to fix this. Additionally, I've pointed out a minor omission in the English documentation for the optimizer parameter. Overall, the changes are good but the validation logic needs to be corrected.

Comment on lines +728 to +731
assert not self.overlap_grad_reduce, 'Muon optimizer does not support overlap grad reduce for now.'
assert not self.overlap_param_gather, 'Muon optimizer does not support overlap param gather for now.'

assert not self.use_distributed_optimizer, 'Muon optimizer does not support distributed optimizer for now.'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The current validation logic for the Muon optimizer incorrectly blocks the use of dist_muon. The check assert not self.use_distributed_optimizer applies to any optimizer with "muon" in its name, including dist_muon. Since use_distributed_optimizer defaults to True, this makes dist_muon unusable by default, which seems to contradict its purpose.

The validation should differentiate between muon and dist_muon. The muon optimizer should not be used with the distributed optimizer, but dist_muon presumably should.

Here is a suggested correction to fix this logic and provide more informative error messages.

Suggested change
assert not self.overlap_grad_reduce, 'Muon optimizer does not support overlap grad reduce for now.'
assert not self.overlap_param_gather, 'Muon optimizer does not support overlap param gather for now.'
assert not self.use_distributed_optimizer, 'Muon optimizer does not support distributed optimizer for now.'
assert not self.overlap_grad_reduce, 'Muon optimizers do not support overlap grad reduce for now.'
assert not self.overlap_param_gather, 'Muon optimizers do not support overlap param gather for now.'
if self.optimizer == 'muon':
assert not self.use_distributed_optimizer, \
'The "muon" optimizer does not support the distributed optimizer. ' \
'Use "dist_muon" for distributed training or set "--use_distributed_optimizer false".'

- Some models may not support flash attention; you need to manually set `--attention_backend unfused/fused --padding_free false`, for example: Llama4, GPT-OSS.
- If `flash_attention_3` is installed, specifying `--attention_backend flash` will prioritize using FA3. Refer to the training script [here](https://github.com/modelscope/ms-swift/tree/main/examples/train/flash_attention_3).
- optimizer: Optimizer type, options are 'adam', 'sgd'. Default is adam.
- optimizer: Optimizer type. Options include 'adam', 'sgd', 'muon', and 'dist_muon'.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The description for the optimizer parameter is missing the default value. For clarity and consistency with other parameter descriptions and the Chinese documentation, please specify that the default optimizer is 'adam'.

Suggested change
- optimizer: Optimizer type. Options include 'adam', 'sgd', 'muon', and 'dist_muon'.
- optimizer: Optimizer type. Options include 'adam', 'sgd', 'muon', and 'dist_muon'. Default is adam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant