Skip to content

Fix ckpt conversion for qwen3-moe models (transformers==5.8.0)#3868

Open
hengtaoguo wants to merge 1 commit into
mainfrom
hengtaoguo-ckpt
Open

Fix ckpt conversion for qwen3-moe models (transformers==5.8.0)#3868
hengtaoguo wants to merge 1 commit into
mainfrom
hengtaoguo-ckpt

Conversation

@hengtaoguo
Copy link
Copy Markdown
Collaborator

@hengtaoguo hengtaoguo commented May 11, 2026

Description

This PR updates the Qwen MoE parameter mapping and hook functions to support breaking configuration changes introduced in transformers==5.8.0 (breaking b/511249077).

1. Configuration Renaming (num_expertsnum_local_experts)

In Qwen3MoeConfig, the configuration attribute tracking the number of MoE experts was renamed (commands to reproduce).

  • 4.57.0: config.num_experts
  • 5.8.0: config.num_local_experts
  • Impact: Caused MaxText's mapping functions to default to 0 experts, mistaking the MoE architecture for a dense model.
  • Solution: This PR checks for both names: num_experts first then num_local_experts. Such fallback pattern supports both old and new transformers versions.

2. Fused Expert Layers (Individual → Stacked/Fused)

In Qwen3MoeForCausalLM, the PyTorch model structure was refactored from individual expert modules to single stacked tensors (commands to reproduce).

  • 4.57.0: Separate expert sub-modules with individual projections (experts.0.gate_proj.weight, experts.0.up_proj.weight, etc.).
  • 5.8.0: All experts are stacked together into a single layer, and the gate/up projections are fused into single tensors (experts.gate_up_proj and experts.down_proj).
  • Impact: Caused missing key errors when MaxText attempted to extract weights via state_dict().
  • Solution: Bypassed successfully by setting to --eager_load_method=safetensors.

Qwen3-MoE to_huggingface runs may be flawed previously, since the _validate_or_update_architecture were checking mis-aligned hyperparameters. This PR updated the logics by comparing intermediate_size with base_mlp_dim*num_experts_per_tok for Qwen3-MoE.

Tests

python -m maxtext.checkpoint_conversion.to_maxtext src/maxtext/configs/base.yml model_name=qwen3-30b-a3b base_output_directory=gs://hengtaoguo-maxtext-logs/checkpoints/qwen3-30b-a3b/Qwen3-30B-A3B-Base/scanned/2026-05-11 scan_layers=true hf_access_token=<your_token> weight_dtype=bfloat16 hardware=cpu skip_jax_distributed_system=True checkpoint_storage_use_ocdbt=False checkpoint_storage_use_zarr3=False --eager_load_method=safetensors

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 11, 2026

Codecov Report

❌ Patch coverage is 0% with 5 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...rc/maxtext/checkpoint_conversion/to_huggingface.py 0.00% 2 Missing ⚠️
...xtext/checkpoint_conversion/utils/param_mapping.py 0.00% 2 Missing ⚠️
...rc/maxtext/checkpoint_conversion/utils/hf_shape.py 0.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@hengtaoguo hengtaoguo changed the title Fix ckpt conversion for qwen3-moe models Fix ckpt conversion for qwen3-moe models (transformers==5.7.0) May 11, 2026
@hengtaoguo hengtaoguo marked this pull request as ready for review May 11, 2026 18:38
@hengtaoguo hengtaoguo changed the title Fix ckpt conversion for qwen3-moe models (transformers==5.7.0) Fix ckpt conversion for qwen3-moe models (transformers==5.8.0) May 11, 2026
Copy link
Copy Markdown
Collaborator

@shuningjin shuningjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the investigation and fix!

@shuningjin
Copy link
Copy Markdown
Collaborator

As a followup, we might consolidate eager_load_method. Might be better to update default to be --eager_load_method=safetensors for the following reasons:

  • resilient from transformer name change across version, as reported here
  • consistency, lazy load uses safe_open
  • allow model without transformer code support (e.g., ds3.2) and allow weights skipped by transformers (e.g., mtp), as mentioned in PR#3184

It was previously default to --eager_load_method=transformers for backward compatibility

  • For most models, both load should give the same model structure (e.g., deepseek2-16b)
  • gemma3-4b is an exception (safetensor not have “model” prefix), current gemma3 mapping uses pretrained-based mapping
  • perhaps in the long run we can migrate these to safetensor-based mapping as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants