[Fix] Enable dpsk r1 mxfp4 V2 model by qichu-yun · Pull Request #934 · ROCm/ATOM

qichu-yun · 2026-05-26T12:15:45Z

Motivation

Enable DeepSeek-R1-0528-MXFP4-V2 to run correctly with SGLang plugin mode on the non-Triton MXFP4 path.

This model stores attention kv_b_proj weights as static Quark MXFP4 (fp4x2, per_1x32). The existing DeepSeek V2 path treated non-Triton FP4 attention weights as unsupported/unquantized and later processed shuffled GEMM-layout weights as if they were still in checkpoint layout, which can corrupt MLA kc/vc weight reconstruction.

Technical Details

This PR adds a narrow static Quark MXFP4 path for DeepSeek V2 attention:

Preserve quant_config for non-Triton static Quark MXFP4 attention layers, while keeping the original behavior for other FP4 non-Triton cases.
Save unshuffled kv_b_proj weight and scale before LinearBase applies GEMM layout shuffling, so MLA post-load processing can dequantize using matching original weight/scale layout.
Update quark_post_load_weights() to handle torch.float4_e2m1fn_x2 static MXFP4 weights by decoding their packed uint8 view and using the preserved unshuffled scale when available.
Update SGLang MLA weight post-processing to recognize Quark MXFP4 via layer_quant_config, read preserved unshuffled kv_b_proj data only for that narrow case, and avoid applying the generic HIP/vLLM layout path to Quark MXFP4 weights.

Test Plan

sglang + ATOM plugin
server:

export AITER_QUICK_REDUCE_QUANTIZATION=INT4
export SGLANG_AITER_FP8_PREFILL_ATTN=0
export SGLANG_USE_AITER=1
export ATOM_ENABLE_DS_QKNORM_QUANT_FUSION=1

model_path=/shared/data/amd_int/models/deepseek-ai/DeepSeek-R1-0528-MXFP4-v2

export SGLANG_EXTERNAL_MODEL_PACKAGE=atom.plugin.sglang.models

TORCHINDUCTOR_COMPILE_THREADS=128 python3 -m sglang.launch_server \
    --model-path $model_path \
    --host localhost \
    --port 9000 \
    --trust-remote-code \
    --tp-size 8 \
    --ep-size 1 \
    --mem-fraction-static 0.9 \
    --disable-radix-cache \

curl:

 curl -X POST "http://localhost:9001/v1/completions" \
     -H "Content-Type: application/json" \
     -d '{
         "prompt": "The capital of China", "temperature": 0, "top_p": 1, "top_k": -1, "repetition_penalty": 1.0, "presence_penalty": 0, "frequency_penalty": 0, "stream": false, "ignore_eos": false, "n": 1, "seed": 123
 }'

Test Result

before：

after：

accuracy:

#!/bin/bash
set -euo pipefail
 
addr=localhost
port=9000
url=http://${addr}:${port}/v1/completions
 
model_path="/mnt/models/DeepSeek-R1-0528-MXFP4-V2"
num_concurrent="${LM_EVAL_CONCURRENT:-64}"
max_gen_toks="${LM_EVAL_MAX_GEN_TOKS:-512}"

lm_eval --model local-completions \
    --model_args "{\"base_url\": \"${url}\", \"model\": \"${model_path}\", \"num_concurrent\": ${num_concurrent}, \"max_retries\": 10, \"max_gen_toks\": ${max_gen_toks}}" \
    --tasks gsm8k \
    --batch_size auto \
    --num_fewshot 5 \
    --trust_remote_code \
        # --limit 300 \

ATOM

export ATOM_ENABLE_DS_QKNORM_FUSION=0
export ATOM_ENABLE_DS_QKNORM_QUANT_FUSION=0
rm -rf /root/.cache/
model_path=/shared/data/amd_int/models/deepseek-ai/DeepSeek-R1-0528-MXFP4-v2
python -m atom.entrypoints.openai_server \
    --model $model_path \
    --host localhost \
    --port 9000 \
    --tensor-parallel-size 8 \
    --kv_cache_dtype fp8 \
    --gpu-memory-utilization 0.8 \
    --no-enable_prefix_caching \

before:

after:

accuracy:

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

wuhuikx · 2026-05-27T01:15:41Z

Please measure the performance before and after this PR. How much performance benefit we can get?
BTW, we need to fuse shared_expert and routed expert together. please help show the trace to make sure they're fused.

ZhiweiYan-96 · 2026-05-27T02:13:16Z

+            and self.source_quant_dtype is None
+            and self.layer_quant_config.quant_method == "quark"
+        ):
+            self._mxfp4_unshuffled_weight = self.weight.detach().clone()


With current model loading patch, ATOM could load the v2 model also, right?

yes, I have supported ATOM also.

qichu-yun · 2026-05-28T09:33:39Z

Please measure the performance before and after this PR. How much performance benefit we can get? BTW, we need to fuse shared_expert and routed expert together. please help show the trace to make sure they're fused.

qichu-yun · 2026-05-28T09:36:18Z

Please measure the performance before and after this PR. How much performance benefit we can get? BTW, we need to fuse shared_expert and routed expert together. please help show the trace to make sure they're fused.

This PR is only an adaptation of the basic model, so there is not much performance improvement. We will gradually optimize the model in the future to achieve good performance.

Here is the fused shared_expert PR: #958

qichu-yun requested review from ZLkanyo009, ZhiweiYan-96, wuhuikx and zhuyuhua-v May 26, 2026 12:20

ZhiweiYan-96 reviewed May 27, 2026

View reviewed changes

Comment thread atom/plugin/sglang/attention_backend/sgl_attention_mla.py

ZhiweiYan-96 reviewed May 27, 2026

View reviewed changes

qichu-yun closed this May 28, 2026

qichu-yun reopened this May 28, 2026

qichu-yun force-pushed the dpsk_v2_model branch 2 times, most recently from 8ffce09 to 84b3a73 Compare May 28, 2026 12:07

qichu-yun added 2 commits May 29, 2026 07:53

[Fix] Enable dpsk r1 mxfp4 V2 model

cd52ba3

[Benchmark] Change model to dpsk v2 model for sglang plugin

13aa539

qichu-yun force-pushed the dpsk_v2_model branch from 84b3a73 to 13aa539 Compare May 29, 2026 12:56

qichu-yun added 2 commits June 1, 2026 05:46

[Fix] Move MXFP4 kv_b_proj preservation into SGLang MLA

bc4628a

[Fix] Handle SGLang MXFP4 kv_b_proj postprocess order

2f98336

qichu-yun force-pushed the dpsk_v2_model branch from 4ffc222 to 2f98336 Compare June 1, 2026 06:42

qichu-yun added 4 commits June 1, 2026 14:47

Merge branch 'main' into dpsk_v2_model

5188732

Merge branch 'main' into dpsk_v2_model

6f0ddd3

Merge branch 'main' into dpsk_v2_model

9f927ec

Merge branch 'main' into dpsk_v2_model

0531fed

valarLip approved these changes Jun 2, 2026

View reviewed changes

valarLip merged commit 11fa32a into main Jun 2, 2026
24 of 31 checks passed

valarLip deleted the dpsk_v2_model branch June 2, 2026 07:31

qichu-yun restored the dpsk_v2_model branch June 2, 2026 12:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] Enable dpsk r1 mxfp4 V2 model#934

[Fix] Enable dpsk r1 mxfp4 V2 model#934
valarLip merged 8 commits into
mainfrom
dpsk_v2_model

qichu-yun commented May 26, 2026 •

edited

Loading

Uh oh!

wuhuikx commented May 27, 2026

Uh oh!

Uh oh!

ZhiweiYan-96 May 27, 2026 •

edited

Loading

Uh oh!

qichu-yun May 28, 2026

Uh oh!

qichu-yun commented May 28, 2026

Uh oh!

qichu-yun commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

qichu-yun commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

wuhuikx commented May 27, 2026

Uh oh!

Uh oh!

ZhiweiYan-96 May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qichu-yun May 28, 2026

Choose a reason for hiding this comment

Uh oh!

qichu-yun commented May 28, 2026

Uh oh!

qichu-yun commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

qichu-yun commented May 26, 2026 •

edited

Loading

ZhiweiYan-96 May 27, 2026 •

edited

Loading