Skip to content

[Fix] Enable dpsk r1 mxfp4 V2 model#934

Merged
valarLip merged 8 commits into
mainfrom
dpsk_v2_model
Jun 2, 2026
Merged

[Fix] Enable dpsk r1 mxfp4 V2 model#934
valarLip merged 8 commits into
mainfrom
dpsk_v2_model

Conversation

@qichu-yun
Copy link
Copy Markdown
Contributor

@qichu-yun qichu-yun commented May 26, 2026

Motivation

Enable DeepSeek-R1-0528-MXFP4-V2 to run correctly with SGLang plugin mode on the non-Triton MXFP4 path.

This model stores attention kv_b_proj weights as static Quark MXFP4 (fp4x2, per_1x32). The existing DeepSeek V2 path treated non-Triton FP4 attention weights as unsupported/unquantized and later processed shuffled GEMM-layout weights as if they were still in checkpoint layout, which can corrupt MLA kc/vc weight reconstruction.

Technical Details

This PR adds a narrow static Quark MXFP4 path for DeepSeek V2 attention:

  • Preserve quant_config for non-Triton static Quark MXFP4 attention layers, while keeping the original behavior for other FP4 non-Triton cases.

  • Save unshuffled kv_b_proj weight and scale before LinearBase applies GEMM layout shuffling, so MLA post-load processing can dequantize using matching original weight/scale layout.

  • Update quark_post_load_weights() to handle torch.float4_e2m1fn_x2 static MXFP4 weights by decoding their packed uint8 view and using the preserved unshuffled scale when available.

  • Update SGLang MLA weight post-processing to recognize Quark MXFP4 via layer_quant_config, read preserved unshuffled kv_b_proj data only for that narrow case, and avoid applying the generic HIP/vLLM layout path to Quark MXFP4 weights.

Test Plan

  1. sglang + ATOM plugin
    server:
export AITER_QUICK_REDUCE_QUANTIZATION=INT4
export SGLANG_AITER_FP8_PREFILL_ATTN=0
export SGLANG_USE_AITER=1
export ATOM_ENABLE_DS_QKNORM_QUANT_FUSION=1

model_path=/shared/data/amd_int/models/deepseek-ai/DeepSeek-R1-0528-MXFP4-v2

export SGLANG_EXTERNAL_MODEL_PACKAGE=atom.plugin.sglang.models

TORCHINDUCTOR_COMPILE_THREADS=128 python3 -m sglang.launch_server \
    --model-path $model_path \
    --host localhost \
    --port 9000 \
    --trust-remote-code \
    --tp-size 8 \
    --ep-size 1 \
    --mem-fraction-static 0.9 \
    --disable-radix-cache \

curl:

 curl -X POST "http://localhost:9001/v1/completions" \
     -H "Content-Type: application/json" \
     -d '{
         "prompt": "The capital of China", "temperature": 0, "top_p": 1, "top_k": -1, "repetition_penalty": 1.0, "presence_penalty": 0, "frequency_penalty": 0, "stream": false, "ignore_eos": false, "n": 1, "seed": 123
 }'

Test Result

before:
image

after:
image

accuracy:

#!/bin/bash
set -euo pipefail
 
addr=localhost
port=9000
url=http://${addr}:${port}/v1/completions
 
model_path="/mnt/models/DeepSeek-R1-0528-MXFP4-V2"
num_concurrent="${LM_EVAL_CONCURRENT:-64}"
max_gen_toks="${LM_EVAL_MAX_GEN_TOKS:-512}"

lm_eval --model local-completions \
    --model_args "{\"base_url\": \"${url}\", \"model\": \"${model_path}\", \"num_concurrent\": ${num_concurrent}, \"max_retries\": 10, \"max_gen_toks\": ${max_gen_toks}}" \
    --tasks gsm8k \
    --batch_size auto \
    --num_fewshot 5 \
    --trust_remote_code \
        # --limit 300 \
image
  1. ATOM
export ATOM_ENABLE_DS_QKNORM_FUSION=0
export ATOM_ENABLE_DS_QKNORM_QUANT_FUSION=0
rm -rf /root/.cache/
model_path=/shared/data/amd_int/models/deepseek-ai/DeepSeek-R1-0528-MXFP4-v2
python -m atom.entrypoints.openai_server \
    --model $model_path \
    --host localhost \
    --port 9000 \
    --tensor-parallel-size 8 \
    --kv_cache_dtype fp8 \
    --gpu-memory-utilization 0.8 \
    --no-enable_prefix_caching \

before:
image

after:
image

accuracy:
image

Submission Checklist

@wuhuikx
Copy link
Copy Markdown
Collaborator

wuhuikx commented May 27, 2026

Please measure the performance before and after this PR. How much performance benefit we can get?
BTW, we need to fuse shared_expert and routed expert together. please help show the trace to make sure they're fused.

Comment thread atom/plugin/sglang/attention_backend/sgl_attention_mla.py
Comment thread atom/model_ops/linear.py Outdated
and self.source_quant_dtype is None
and self.layer_quant_config.quant_method == "quark"
):
self._mxfp4_unshuffled_weight = self.weight.detach().clone()
Copy link
Copy Markdown
Contributor

@ZhiweiYan-96 ZhiweiYan-96 May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With current model loading patch, ATOM could load the v2 model also, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I have supported ATOM also.

@qichu-yun
Copy link
Copy Markdown
Contributor Author

Please measure the performance before and after this PR. How much performance benefit we can get? BTW, we need to fuse shared_expert and routed expert together. please help show the trace to make sure they're fused.

@qichu-yun qichu-yun closed this May 28, 2026
@qichu-yun qichu-yun reopened this May 28, 2026
@qichu-yun
Copy link
Copy Markdown
Contributor Author

Please measure the performance before and after this PR. How much performance benefit we can get? BTW, we need to fuse shared_expert and routed expert together. please help show the trace to make sure they're fused.

This PR is only an adaptation of the basic model, so there is not much performance improvement. We will gradually optimize the model in the future to achieve good performance.

Here is the fused shared_expert PR: #958

@qichu-yun qichu-yun force-pushed the dpsk_v2_model branch 2 times, most recently from 8ffce09 to 84b3a73 Compare May 28, 2026 12:07
@valarLip valarLip merged commit 11fa32a into main Jun 2, 2026
24 of 31 checks passed
@valarLip valarLip deleted the dpsk_v2_model branch June 2, 2026 07:31
@qichu-yun qichu-yun restored the dpsk_v2_model branch June 2, 2026 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants