Skip to content

feat: support DeepSeek-V4-Flash-Base model on gfx942 device.#996

Open
junna2016 wants to merge 1 commit into
ROCm:mainfrom
junna2016:xjn_308_dsv4_fp8
Open

feat: support DeepSeek-V4-Flash-Base model on gfx942 device.#996
junna2016 wants to merge 1 commit into
ROCm:mainfrom
junna2016:xjn_308_dsv4_fp8

Conversation

@junna2016
Copy link
Copy Markdown

Motivation

MI308 support DeepSeek-v4 model with fp8 moe

FP8 on MI308 / gfx942 (V4-Flash-Base, FP8 per-block routed experts)

DeepSeek-V4-Flash-Base ships the same V4 architecture (mHC + CSA + HCA + sparse attn + MTP) as V4-Pro, but routed experts are FP8 e4m3 per-block 128×128 (instead of V4-Pro's FP4 e2m1 microscaling). This trades a small expert-memory increase for end-to-end ROCm gfx942 (MI308) compatibility — aiter's FP8 grouped GEMM has been tuned for gfx942, while the FP4 path was authored for gfx950 (MI355X).

python -m atom.entrypoints.openai_server \
  --model deepseek-ai/DeepSeek-V4-Flash-Base \
  --kv_cache_dtype fp8 -tp 8

The routed-expert quant scheme is auto-detected from the HF quantization_config dict:

Field V4-Pro (FP4) V4-Flash-Base (FP8)
quant_method quark (with FP4 layer pattern) fp8
fmt e2m1 e4m3
weight_block_size (per_1x32, microscaling) [128, 128]
scale_fmt ue8m0 ue8m0

Override knobs (escape hatches, normally not needed):

  • ATOM_V4_ROUTED_QUANT={fp4,fp8_block} — forces the routed-expert path. Useful for debugging or when the auto-detection picks the wrong scheme. fp8 and fp8_per_block are valid aliases for fp8_block.
  • ATOM_V4_DISABLE_FUSED_SHARED=1 — disables the aiter fused shared+routed expert kernel. On V4-Flash-Base both routed and shared experts are FP8 (matching dtype), so the framework auto-enables fusion. If you hit numerical instabilities or kernel issues on a specific GPU, set this to 1 to keep them as 2 separate kernels.
  • ATOM_USE_TRITON_MOE=1gfx942 defaults to Triton MoE automatically (no need to set), but it doesn't hurt to set explicitly. Required on gfx950 for V4-Pro (see V4-Pro section above).

Auto-detection logic

The routed-expert quant spec is resolved in this priority order (see _detect_v4_routed_quant_spec):

  1. ATOM_V4_ROUTED_QUANT env override — explicit forcing.
  2. Parser-derived layer spec — if the ckpt's quantization_config.layer_quant_config (Quark) or global config (compressed-tensors / generic) directly produces a per-layer spec for ffn.experts.*.w*, that wins.
  3. Heuristic from quant_method / fmt — strings containing fp8 → FP8 block; fp4 / mxfp4 → FP4.
  4. V4-Pro fallback — historical default.

For V4-Flash-Base's HF quantization_config = {"quant_method": "fp8", "fmt": "e4m3", "weight_block_size": [128, 128], "scale_fmt": "ue8m0"}, the GenericParser (regex block|1x128) extracts (per_1x128, fp8) global spec, and step 2 hits → routed expert spec is (QuantType.per_1x128, dtypes.fp8). dtypes.fp8 from aiter resolves to float8_e4m3fnuz on gfx942 and float8_e4m3fn on gfx950 — picked correctly per platform without code changes.

MI308 specifics

  • KV pool slot sizes are identical to V4-Pro (584 B per token, FP8 NoPE 448 B + BF16 RoPE 128 B + 8 B UE8M0 scales).
  • The CSA indexer's K cache stays FP8 (132 B / token) regardless of routed-expert dtype.
  • Compressor / Indexer Triton kernels (fused_compress_attn, sparse_attn_v4_paged_decode) are SKU-agnostic.
  • Three-stream concurrency (main / alt / compress) works identically.
  • TP / EP sharding follows V4-Pro layout — n_routed_experts=256, top-k=6 matches the standard FusedMoE expert-shard math.

Comment thread atom/model_ops/v4_kernels/paged_decode.py Outdated
Comment thread atom/models/deepseek_v4.py Outdated
Comment thread atom/models/deepseek_v4.py
Comment thread atom/models/deepseek_v4.py Outdated
Comment thread atom/models/deepseek_v4.py Outdated
Comment thread atom/models/deepseek_v4.py Outdated
Comment thread atom/models/deepseek_v4.py Outdated
@yitingw1
Copy link
Copy Markdown

yitingw1 commented Jun 1, 2026

Locally tested DeepSeek-V4-Flash-Base with 8xMI308 using above server scripts.
Here is the accuracy results:

  • scripts
lm_eval --model local-completions \
  --model_args model=deepseek-ai/DeepSeek-V4-Flash-Base,base_url=http://localhost:8000/v1/completions,num_concurrent=64,max_retries=3,tokenized_requests=False \
  --tasks gsm8k --num_fewshot 5 
  • results:
Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.8613 ± 0.0095
strict-match 5 exact_match 0.8613 ± 0.0095

@junna2016
Copy link
Copy Markdown
Author

junna2016 commented Jun 1, 2026

launch server:

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
  PYTHONPATH=/workspace/aiter:/workspace/ATOM \
  AITER_USE_SYSTEM_TRITON=1 \
  AITER_LOG_LEVEL=WARNING \
  ATOM_FORCE_ATTN_TRITON=1 \
  python -m atom.entrypoints.openai_server \
    --model /models/DeepSeek-V4-Flash-Base \
    --kv_cache_dtype fp8 \
    --server-port 9677 \
    --max-model-len 16384 \
    --max-num-seqs 256 \
    --gpu-memory-utilization 0.95 \
    -tp 8

gsm8k test results:

  • script:
lm_eval --model local-completions \
  --model_args model=/models/DeepSeek-V4-Flash-Base,base_url=http://localhost:9677/v1/completions,num_concurrent=64,max_retries=3,tokenized_requests=False \
  --tasks gsm8k --num_fewshot 5
  • results:
Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.8560 ± 0.0097
strict-match 5 exact_match 0.8552 ± 0.0097
  • with mtp=3
Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.862 ± 0.0095
strict-match 5 exact_match 0.862 ± 0.0095
  • with ep=tp=8 mtp=3
Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.8666 ± 0.0094
strict-match 5 exact_match 0.8673 ± 0.0093

@valarLip
Copy link
Copy Markdown
Collaborator

valarLip commented Jun 1, 2026

thanks @junna2016 @yitingw1 , waiting ci

@valarLip
Copy link
Copy Markdown
Collaborator

valarLip commented Jun 1, 2026

emmm could you please fix this one ... https://github.com/ROCm/ATOM/actions/runs/26747521301/job/78849193273?pr=996

@junna2016
Copy link
Copy Markdown
Author

https://github.com/ROCm/ATOM/actions/runs/26747521301/job/78849193273?pr=996

Sure,I will format code with right style. Thanks for reminding.

"""

# ── 1. Explicit env override ──
forced = os.environ.get("ATOM_V4_ROUTED_QUANT", "").strip().lower()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we can use hf_config instead of introducing a new env "ATOM_V4_ROUTED_QUANT".
For example:

expert_dtype = getattr(hf_config, "expert_dtype", None) or ""
if isinstance(expert_dtype, str):
    ed = expert_dtype.lower()
    if "fp4" in ed:
        return fp4_spec
    if "fp8" in ed:
        return fp8_block_spec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants