Fix ckpt conversion for qwen3-moe models (transformers==5.8.0)#3868
Open
hengtaoguo wants to merge 1 commit into
Open
Fix ckpt conversion for qwen3-moe models (transformers==5.8.0)#3868hengtaoguo wants to merge 1 commit into
hengtaoguo wants to merge 1 commit into
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
0074336 to
244a5d1
Compare
shuningjin
approved these changes
May 11, 2026
Collaborator
shuningjin
left a comment
There was a problem hiding this comment.
Thank you for the investigation and fix!
NicoGrande
approved these changes
May 11, 2026
Collaborator
|
As a followup, we might consolidate eager_load_method. Might be better to update default to be
It was previously default to
|
dc5216d to
4f722e7
Compare
3c65e3d to
37ff9a7
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR updates the Qwen MoE parameter mapping and hook functions to support breaking configuration changes introduced in
transformers==5.8.0(breaking b/511249077).1. Configuration Renaming (
num_experts→num_local_experts)In
Qwen3MoeConfig, the configuration attribute tracking the number of MoE experts was renamed (commands to reproduce).4.57.0:config.num_experts5.8.0:config.num_local_experts0experts, mistaking the MoE architecture for a dense model.num_expertsfirst thennum_local_experts. Such fallback pattern supports both old and new transformers versions.2. Fused Expert Layers (Individual → Stacked/Fused)
In
Qwen3MoeForCausalLM, the PyTorch model structure was refactored from individual expert modules to single stacked tensors (commands to reproduce).4.57.0: Separate expert sub-modules with individual projections (experts.0.gate_proj.weight,experts.0.up_proj.weight, etc.).5.8.0: All experts are stacked together into a single layer, and the gate/up projections are fused into single tensors (experts.gate_up_projandexperts.down_proj).state_dict().--eager_load_method=safetensors.Qwen3-MoE
to_huggingfaceruns may be flawed previously, since the_validate_or_update_architecturewere checking mis-aligned hyperparameters. This PR updated the logics by comparingintermediate_sizewithbase_mlp_dim*num_experts_per_tokfor Qwen3-MoE.Tests
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.