Fix ckpt conversion for qwen3-moe models (transformers==5.8.0) by hengtaoguo · Pull Request #3868 · AI-Hypercomputer/maxtext

hengtaoguo · 2026-05-11T18:06:33Z

Description

This PR updates the Qwen MoE parameter mapping and hook functions to support breaking configuration changes introduced in transformers==5.8.0 (breaking b/511249077).

1. Configuration Renaming (`num_experts` → `num_local_experts`)

In Qwen3MoeConfig, the configuration attribute tracking the number of MoE experts was renamed (commands to reproduce).

4.57.0: config.num_experts
5.8.0: config.num_local_experts
Impact: Caused MaxText's mapping functions to default to 0 experts, mistaking the MoE architecture for a dense model.
Solution: This PR checks for both names: num_experts first then num_local_experts. Such fallback pattern supports both old and new transformers versions.

2. Fused Expert Layers (Individual → Stacked/Fused)

In Qwen3MoeForCausalLM, the PyTorch model structure was refactored from individual expert modules to single stacked tensors (commands to reproduce).

4.57.0: Separate expert sub-modules with individual projections (experts.0.gate_proj.weight, experts.0.up_proj.weight, etc.).
5.8.0: All experts are stacked together into a single layer, and the gate/up projections are fused into single tensors (experts.gate_up_proj and experts.down_proj).
Impact: Caused missing key errors when MaxText attempted to extract weights via state_dict().
Solution: Bypassed successfully by setting to --eager_load_method=safetensors.

Qwen3-MoE to_huggingface runs may be flawed previously, since the _validate_or_update_architecture were checking mis-aligned hyperparameters. This PR updated the logics by comparing intermediate_size with base_mlp_dim*num_experts_per_tok for Qwen3-MoE.

Tests

python -m maxtext.checkpoint_conversion.to_maxtext src/maxtext/configs/base.yml model_name=qwen3-30b-a3b base_output_directory=gs://hengtaoguo-maxtext-logs/checkpoints/qwen3-30b-a3b/Qwen3-30B-A3B-Base/scanned/2026-05-11 scan_layers=true hf_access_token=<your_token> weight_dtype=bfloat16 hardware=cpu skip_jax_distributed_system=True checkpoint_storage_use_ocdbt=False checkpoint_storage_use_zarr3=False --eager_load_method=safetensors

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-05-11T18:11:51Z

Codecov Report

❌ Patch coverage is 0% with 5 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...rc/maxtext/checkpoint_conversion/to_huggingface.py	0.00%	2 Missing ⚠️
...xtext/checkpoint_conversion/utils/param_mapping.py	0.00%	2 Missing ⚠️
...rc/maxtext/checkpoint_conversion/utils/hf_shape.py	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

shuningjin

Thank you for the investigation and fix!

shuningjin · 2026-05-11T19:32:14Z

As a followup, we might consolidate eager_load_method. Might be better to update default to be --eager_load_method=safetensors for the following reasons:

resilient from transformer name change across version, as reported here
consistency, lazy load uses safe_open
allow model without transformer code support (e.g., ds3.2) and allow weights skipped by transformers (e.g., mtp), as mentioned in PR#3184

It was previously default to --eager_load_method=transformers for backward compatibility

For most models, both load should give the same model structure (e.g., deepseek2-16b)
gemma3-4b is an exception (safetensor not have “model” prefix), current gemma3 mapping uses pretrained-based mapping
perhaps in the long run we can migrate these to safetensor-based mapping as well

hengtaoguo changed the title ~~Fix ckpt conversion for qwen3-moe models~~ Fix ckpt conversion for qwen3-moe models (transformers==5.7.0) May 11, 2026

hengtaoguo force-pushed the hengtaoguo-ckpt branch from 0074336 to 244a5d1 Compare May 11, 2026 18:38

hengtaoguo marked this pull request as ready for review May 11, 2026 18:38

hengtaoguo requested review from NicoGrande, RissyRan, bvandermoon, gagika, gobbleturk, jiangjy1982, parambole, richjames0, shralex, shuningjin and suexu1025 as code owners May 11, 2026 18:38

hengtaoguo changed the title ~~Fix ckpt conversion for qwen3-moe models (transformers==5.7.0)~~ Fix ckpt conversion for qwen3-moe models (transformers==5.8.0) May 11, 2026

shuningjin approved these changes May 11, 2026

View reviewed changes

NicoGrande approved these changes May 11, 2026

View reviewed changes

hengtaoguo force-pushed the hengtaoguo-ckpt branch from dc5216d to 4f722e7 Compare May 11, 2026 22:48

Fix ckpt conversion for qwen3-moe models

37ff9a7

hengtaoguo force-pushed the hengtaoguo-ckpt branch from 3c65e3d to 37ff9a7 Compare May 11, 2026 22:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ckpt conversion for qwen3-moe models (transformers==5.8.0)#3868

Fix ckpt conversion for qwen3-moe models (transformers==5.8.0)#3868
hengtaoguo wants to merge 1 commit into
mainfrom
hengtaoguo-ckpt

hengtaoguo commented May 11, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 11, 2026 •

edited

Loading

Uh oh!

shuningjin left a comment

Uh oh!

shuningjin commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hengtaoguo commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

1. Configuration Renaming (num_experts → num_local_experts)

2. Fused Expert Layers (Individual → Stacked/Fused)

Tests

Checklist

Uh oh!

codecov Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

shuningjin left a comment

Choose a reason for hiding this comment

Uh oh!

shuningjin commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hengtaoguo commented May 11, 2026 •

edited

Loading

1. Configuration Renaming (`num_experts` → `num_local_experts`)

codecov Bot commented May 11, 2026 •

edited

Loading