[Fix] Enable dpsk r1 mxfp4 V2 model#934
Conversation
|
Please measure the performance before and after this PR. How much performance benefit we can get? |
| and self.source_quant_dtype is None | ||
| and self.layer_quant_config.quant_method == "quark" | ||
| ): | ||
| self._mxfp4_unshuffled_weight = self.weight.detach().clone() |
There was a problem hiding this comment.
With current model loading patch, ATOM could load the v2 model also, right?
There was a problem hiding this comment.
yes, I have supported ATOM also.
|
This PR is only an adaptation of the basic model, so there is not much performance improvement. We will gradually optimize the model in the future to achieve good performance. Here is the fused shared_expert PR: #958 |
8ffce09 to
84b3a73
Compare
Motivation
Enable DeepSeek-R1-0528-MXFP4-V2 to run correctly with SGLang plugin mode on the non-Triton MXFP4 path.
This model stores attention kv_b_proj weights as static Quark MXFP4 (fp4x2, per_1x32). The existing DeepSeek V2 path treated non-Triton FP4 attention weights as unsupported/unquantized and later processed shuffled GEMM-layout weights as if they were still in checkpoint layout, which can corrupt MLA kc/vc weight reconstruction.
Technical Details
This PR adds a narrow static Quark MXFP4 path for DeepSeek V2 attention:
Preserve quant_config for non-Triton static Quark MXFP4 attention layers, while keeping the original behavior for other FP4 non-Triton cases.
Save unshuffled kv_b_proj weight and scale before LinearBase applies GEMM layout shuffling, so MLA post-load processing can dequantize using matching original weight/scale layout.
Update quark_post_load_weights() to handle torch.float4_e2m1fn_x2 static MXFP4 weights by decoding their packed uint8 view and using the preserved unshuffled scale when available.
Update SGLang MLA weight post-processing to recognize Quark MXFP4 via layer_quant_config, read preserved unshuffled kv_b_proj data only for that narrow case, and avoid applying the generic HIP/vLLM layout path to Quark MXFP4 weights.
Test Plan
server:
curl:
Test Result
before:

after:

accuracy:
before:

after:

accuracy:

Submission Checklist