Skip to content

[Feat] enable fuse share expert in DeepSeek-R1-0528-MXFP4-v2 in sgl atom#958

Open
ZLkanyo009 wants to merge 1 commit into
mainfrom
lingzha/fuse-share-expert
Open

[Feat] enable fuse share expert in DeepSeek-R1-0528-MXFP4-v2 in sgl atom#958
ZLkanyo009 wants to merge 1 commit into
mainfrom
lingzha/fuse-share-expert

Conversation

@ZLkanyo009
Copy link
Copy Markdown
Contributor

@ZLkanyo009 ZLkanyo009 commented May 28, 2026

Motivation

In DeepSeek-R1-0528-MXFP4-v2, layer 61 has a situation where shared experts and routed experts are of different types, while the other layers still have the same type. The original is_rocm_aiter_fusion_shared_expert_enabled() function will cause all layers to skip shared expert fusion. This PR enables shared expert fusion for the compatible layers in DeepSeek-R1-0528-MXFP4-v2. For other models, the original behavior is preserved to ensure compatibility.

Command

export CUDA_VISIBLE_DEVICES=0,1,2,3

export AITER_QUICK_REDUCE_QUANTIZATION=INT4
export SGLANG_AITER_FP8_PREFILL_ATTN=0
export SGLANG_USE_AITER=1
export ATOM_ENABLE_DS_QKNORM_QUANT_FUSION=1

model_path=/models/deepseek-ai/DeepSeek-R1-0528-MXFP4-v2

export SGLANG_EXTERNAL_MODEL_PACKAGE=atom.plugin.sglang.models
 
TORCHINDUCTOR_COMPILE_THREADS=128 python3 -m sglang.launch_server \
    --model-path $model_path \
    --host localhost \
    --port 8000 \
    --trust-remote-code \
    --tensor-parallel-size 4 \
    --mem-fraction-static 0.9 \
    --disable-radix-cache \
    --attention-backend aiter \
    --kv-cache-dtype fp8_e4m3 \
    --max-running-requests 128
export CUDA_VISIBLE_DEVICES=0,1,2,3

export AITER_QUICK_REDUCE_QUANTIZATION=INT4
export SGLANG_AITER_FP8_PREFILL_ATTN=0
export SGLANG_USE_AITER=1
export ATOM_ENABLE_DS_QKNORM_QUANT_FUSION=1

export MORI_SHMEM_MODE=ISOLATION

model_path=/models/deepseek-ai/DeepSeek-R1-0528-MXFP4-v2

export SGLANG_EXTERNAL_MODEL_PACKAGE=atom.plugin.sglang.models
 
TORCHINDUCTOR_COMPILE_THREADS=128 python3 -m sglang.launch_server \
    --model-path $model_path \
    --host localhost \
    --port 8000 \
    --trust-remote-code \
    --tensor-parallel-size 4 \
    --data-parallel-size 4 \
    --expert-parallel-size 4 \
    --enable-dp-attention \
    --mem-fraction-static 0.9 \
    --disable-radix-cache \
    --attention-backend aiter \
    --kv-cache-dtype fp8_e4m3 \
    --max-running-requests 128

Performance

before:
image
The shared expert is not fused into the MoE.

after:
image
The shared expert is fused into the MoE.

@ZLkanyo009 ZLkanyo009 force-pushed the lingzha/fuse-share-expert branch from da41796 to 94417a5 Compare May 28, 2026 07:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant