[megatron] support muon by Jintao-Huang · Pull Request #8392 · modelscope/ms-swift

Jintao-Huang · 2026-03-20T09:53:13Z

No description provided.

gemini-code-assist · 2026-03-20T09:53:31Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the Muon optimizer into the Megatron-SWIFT framework, providing users with additional advanced optimization options for training large language models. It updates both the core argument parsing logic and the user-facing documentation to reflect the new optimizer types and their configurable parameters, along with initial compatibility constraints.

Highlights

New Optimizer Support: Added 'muon' and 'dist_muon' as new optimizer types, expanding the available optimization strategies for training.
Muon Optimizer Parameters: Introduced a comprehensive set of command-line parameters for configuring the Muon optimizer, including momentum, QKV splitting, Nesterov usage, scale mode, FP32 matrix multiplication precision, Newton-Schulz steps, tensor parallel mode, and an extra scale factor.
Compatibility Checks: Implemented validation checks to ensure that the Muon optimizer is not used with overlap_grad_reduce, overlap_param_gather, or use_distributed_optimizer in its current state.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for the Muon optimizer, including muon and dist_muon options. The changes include updating the command-line arguments in both Python code and documentation.

My review found a critical issue in the argument validation logic that would prevent dist_muon from being used with its default settings. I've provided a suggestion to fix this. Additionally, I've pointed out a minor omission in the English documentation for the optimizer parameter. Overall, the changes are good but the validation logic needs to be corrected.

gemini-code-assist · 2026-03-20T09:55:16Z

swift/megatron/arguments/megatron_args.py

+            assert not self.overlap_grad_reduce, 'Muon optimizer does not support overlap grad reduce for now.'
+            assert not self.overlap_param_gather, 'Muon optimizer does not support overlap param gather for now.'
+
+            assert not self.use_distributed_optimizer, 'Muon optimizer does not support distributed optimizer for now.'


The current validation logic for the Muon optimizer incorrectly blocks the use of dist_muon. The check assert not self.use_distributed_optimizer applies to any optimizer with "muon" in its name, including dist_muon. Since use_distributed_optimizer defaults to True, this makes dist_muon unusable by default, which seems to contradict its purpose.

The validation should differentiate between muon and dist_muon. The muon optimizer should not be used with the distributed optimizer, but dist_muon presumably should.

Here is a suggested correction to fix this logic and provide more informative error messages.

Suggested change

assert not self.overlap_grad_reduce, 'Muon optimizer does not support overlap grad reduce for now.'

assert not self.overlap_param_gather, 'Muon optimizer does not support overlap param gather for now.'

assert not self.use_distributed_optimizer, 'Muon optimizer does not support distributed optimizer for now.'

assert not self.overlap_grad_reduce, 'Muon optimizers do not support overlap grad reduce for now.'

assert not self.overlap_param_gather, 'Muon optimizers do not support overlap param gather for now.'

if self.optimizer == 'muon':

assert not self.use_distributed_optimizer, \

'The "muon" optimizer does not support the distributed optimizer. ' \

'Use "dist_muon" for distributed training or set "--use_distributed_optimizer false".'

gemini-code-assist · 2026-03-20T09:55:16Z

docs/source_en/Megatron-SWIFT/Command-line-parameters.md

  - Some models may not support flash attention; you need to manually set `--attention_backend unfused/fused --padding_free false`, for example: Llama4, GPT-OSS.
  - If `flash_attention_3` is installed, specifying `--attention_backend flash` will prioritize using FA3. Refer to the training script [here](https://github.com/modelscope/ms-swift/tree/main/examples/train/flash_attention_3).
- optimizer: Optimizer type, options are 'adam', 'sgd'. Default is adam.
+- optimizer: Optimizer type. Options include 'adam', 'sgd', 'muon', and 'dist_muon'.


The description for the optimizer parameter is missing the default value. For clarity and consistency with other parameter descriptions and the Chinese documentation, please specify that the default optimizer is 'adam'.

Suggested change

- optimizer: Optimizer type. Options include 'adam', 'sgd', 'muon', and 'dist_muon'.

- optimizer: Optimizer type. Options include 'adam', 'sgd', 'muon', and 'dist_muon'. Default is adam.

support muon

f6d1914

fix

5b429a6

gemini-code-assist bot reviewed Mar 20, 2026

View reviewed changes

update

b98c495

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[megatron] support muon#8392

[megatron] support muon#8392
Jintao-Huang wants to merge 3 commits intomodelscope:mainfrom
Jintao-Huang:support_muon_123

Jintao-Huang commented Mar 20, 2026

Uh oh!

gemini-code-assist bot commented Mar 20, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 20, 2026

Uh oh!

gemini-code-assist bot Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-            assert not self.overlap_grad_reduce, 'Muon optimizer does not support overlap grad reduce for now.'
-            assert not self.overlap_param_gather, 'Muon optimizer does not support overlap param gather for now.'
-            assert not self.use_distributed_optimizer, 'Muon optimizer does not support distributed optimizer for now.'
+            assert not self.overlap_grad_reduce, 'Muon optimizers do not support overlap grad reduce for now.'
+            assert not self.overlap_param_gather, 'Muon optimizers do not support overlap param gather for now.'
+            if self.optimizer == 'muon':
+                assert not self.use_distributed_optimizer, \
+                    'The "muon" optimizer does not support the distributed optimizer. ' \
+                    'Use "dist_muon" for distributed training or set "--use_distributed_optimizer false".'

	- optimizer: Optimizer type. Options include 'adam', 'sgd', 'muon', and 'dist_muon'.
	- optimizer: Optimizer type. Options include 'adam', 'sgd', 'muon', and 'dist_muon'. Default is adam.

Conversation

Jintao-Huang commented Mar 20, 2026

Uh oh!

gemini-code-assist bot commented Mar 20, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant