Feat/minimax m2.5 support#1929
Open
xs1997zju wants to merge 2 commits into
Open
Conversation
Add full integration for MiniMax-M2.5, a 229B MoE model with 256 experts and top-8 routing. This includes: - Model spec plugin with custom SelfAttention for full-dimension QK Norm (RMSNorm over all heads concatenated, with TP gather/scatter) - mbridge weight bridge (HF <-> Megatron conversion via Qwen2MoEBridge) - Megatron-to-HF converter for saving trained checkpoints - Shell scripts: model args, RL training launch, HF<->Megatron weight conversion (3-script pipeline) Key architecture differences from standard Qwen2MoE: - block_sparse_moe prefix with w1/w2/w3 expert naming - Full-dimension QK Norm (q_norm/k_norm, not per-head) - Sigmoid router with e_score_correction_bias - Partial RoPE (rotary_percent=0.5) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add full integration for MiniMax-M2.5 (256 experts, top-8 routing), including:
slime_plugins/models/minimax_m2.py): CustomSelfAttentionwith full-dimension QK Norm (RMSNorm over all heads concatenated, with TP gather/scatter)slime_plugins/mbridge/minimax_m2.py): HF ↔ Megatron weight mapping extendingQwen2MoEBridgeslime/backends/megatron_utils/megatron_to_hf/minimax_m2.py): Reverse conversion for saving trained checkpoints back to HF formatscripts/): Model architecture args, RL training launch script, and 3-script HF ↔ Megatron weight conversion pipelineKey architecture differences from standard Qwen2MoE
block_sparse_moe(w1/w2/w3)mlpe_score_correction_bias