Checklist
Describe the bug
MXFP4 model loading fails: export_weight assertion doesn't include e2m1 weight type
Environment
- lmdeploy 0.12.2+cu128 (Windows, Python 3.12)
- Model quantized with llmcompressor 0.10.0.1 using
QuantizationModifier(scheme="MXFP4A16")
Description
TurboMind cannot load MXFP4 models produced by llmcompressor. The converter sets weight_type='e2m1' globally for MXFP4 models, but export_weight() doesn't include 'e2m1' in its allowed types, causing an assertion failure when processing non-quantized layers (e.g. RMSNorm weights).
Steps to reproduce
- Quantize any model with llmcompressor MXFP4A16:
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
recipe = [QuantizationModifier(targets=["Linear"], scheme="MXFP4A16", ignore=["lm_head"])]
oneshot(model=model, dataset=ds, recipe=recipe, ...)
model.save_pretrained(output_dir)
-
Patch config.json so lmdeploy routes to its mxfp4 path (llmcompressor saves quant_method: "compressed-tensors", but lmdeploy requires quant_method: "mxfp4").
-
Load with TurboMind:
from lmdeploy import pipeline, TurbomindEngineConfig
pipe = pipeline(output_dir, backend_config=TurbomindEngineConfig(model_format="mxfp4"))
Error
File "lmdeploy/turbomind/deploy/target_model/base.py", line 146, in export_weight
assert weight_type in ['float16', 'bfloat16', 'int4', 'fp8']
AssertionError
Stack trace shows it fails on layers.{i}.attention_norm.weight — a norm layer that should remain in float, not e2m1.
Root cause
In converter.py lines 86–88, model_format == 'mxfp4' sets weight_type = 'e2m1' globally for all layers. This flows into export_weight() (base.py:146) which only allows ['float16', 'bfloat16', 'int4', 'fp8']. The 'e2m1' type is missing from this list.
Additionally, there's no logic to differentiate between quantized Linear weights (which should be e2m1) and non-quantized norm/embedding weights (which should remain float). Compare with the GptOssForCausalLM special case at converter.py:91–93 which resets weight_type = dtype — suggesting this differentiation is needed but not generalized.
Secondary issue
There is also a routing problem: llmcompressor saves MXFP4 models with quant_method: "compressed-tensors" and format: "mxfp4-pack-quantized" in the quantization config. lmdeploy's compressed-tensors path (converter.py:149) only accepts format == 'pack-quantized', rejecting mxfp4-pack-quantized. This means the model can't load through either path without manually patching config.json.
Reproduction
I ran a benchmarking script after successfully converting a model using llmcompressor.
Environment
Windows 10, RTX 4090, Python 3.12, torch 2.9.0, lmdeploy 0.12.2, compressed-tensors 0.14.0.1, llmcompressor 0.10.0.1, cuda 12.8.1, transformers 4.57.6, triton-windows 3.5.0.post21
Error traceback
Checklist
Describe the bug
MXFP4 model loading fails:
export_weightassertion doesn't includee2m1weight typeEnvironment
QuantizationModifier(scheme="MXFP4A16")Description
TurboMind cannot load MXFP4 models produced by llmcompressor. The converter sets
weight_type='e2m1'globally for MXFP4 models, butexport_weight()doesn't include'e2m1'in its allowed types, causing an assertion failure when processing non-quantized layers (e.g. RMSNorm weights).Steps to reproduce
Patch
config.jsonso lmdeploy routes to its mxfp4 path (llmcompressor savesquant_method: "compressed-tensors", but lmdeploy requiresquant_method: "mxfp4").Load with TurboMind:
Error
Stack trace shows it fails on
layers.{i}.attention_norm.weight— a norm layer that should remain in float, not e2m1.Root cause
In
converter.pylines 86–88,model_format == 'mxfp4'setsweight_type = 'e2m1'globally for all layers. This flows intoexport_weight()(base.py:146) which only allows['float16', 'bfloat16', 'int4', 'fp8']. The'e2m1'type is missing from this list.Additionally, there's no logic to differentiate between quantized Linear weights (which should be e2m1) and non-quantized norm/embedding weights (which should remain float). Compare with the
GptOssForCausalLMspecial case atconverter.py:91–93which resetsweight_type = dtype— suggesting this differentiation is needed but not generalized.Secondary issue
There is also a routing problem: llmcompressor saves MXFP4 models with
quant_method: "compressed-tensors"andformat: "mxfp4-pack-quantized"in the quantization config. lmdeploy's compressed-tensors path (converter.py:149) only acceptsformat == 'pack-quantized', rejectingmxfp4-pack-quantized. This means the model can't load through either path without manually patchingconfig.json.Reproduction
I ran a benchmarking script after successfully converting a model using llmcompressor.
Environment
Error traceback