feat: support DeepSeek-V4-Flash-Base model on gfx942 device. by junna2016 · Pull Request #996 · ROCm/ATOM

junna2016 · 2026-06-01T02:06:30Z

Motivation

MI308 support DeepSeek-v4 model with fp8 moe

FP8 on MI308 / gfx942 (V4-Flash-Base, FP8 per-block routed experts)

DeepSeek-V4-Flash-Base ships the same V4 architecture (mHC + CSA + HCA + sparse attn + MTP) as V4-Pro, but routed experts are FP8 e4m3 per-block 128×128 (instead of V4-Pro's FP4 e2m1 microscaling). This trades a small expert-memory increase for end-to-end ROCm gfx942 (MI308) compatibility — aiter's FP8 grouped GEMM has been tuned for gfx942, while the FP4 path was authored for gfx950 (MI355X).

python -m atom.entrypoints.openai_server \
  --model deepseek-ai/DeepSeek-V4-Flash-Base \
  --kv_cache_dtype fp8 -tp 8

The routed-expert quant scheme is auto-detected from the HF quantization_config dict:

Field	V4-Pro (FP4)	V4-Flash-Base (FP8)
`quant_method`	`quark` (with FP4 layer pattern)	`fp8`
`fmt`	`e2m1`	`e4m3`
`weight_block_size`	(per_1x32, microscaling)	`[128, 128]`
`scale_fmt`	`ue8m0`	`ue8m0`

Override knobs (escape hatches, normally not needed):

ATOM_V4_ROUTED_QUANT={fp4,fp8_block} — forces the routed-expert path. Useful for debugging or when the auto-detection picks the wrong scheme. fp8 and fp8_per_block are valid aliases for fp8_block.
ATOM_V4_DISABLE_FUSED_SHARED=1 — disables the aiter fused shared+routed expert kernel. On V4-Flash-Base both routed and shared experts are FP8 (matching dtype), so the framework auto-enables fusion. If you hit numerical instabilities or kernel issues on a specific GPU, set this to 1 to keep them as 2 separate kernels.
ATOM_USE_TRITON_MOE=1 — gfx942 defaults to Triton MoE automatically (no need to set), but it doesn't hurt to set explicitly. Required on gfx950 for V4-Pro (see V4-Pro section above).

Auto-detection logic

The routed-expert quant spec is resolved in this priority order (see _detect_v4_routed_quant_spec):

ATOM_V4_ROUTED_QUANT env override — explicit forcing.
Parser-derived layer spec — if the ckpt's quantization_config.layer_quant_config (Quark) or global config (compressed-tensors / generic) directly produces a per-layer spec for ffn.experts.*.w*, that wins.
Heuristic from quant_method / fmt — strings containing fp8 → FP8 block; fp4 / mxfp4 → FP4.
V4-Pro fallback — historical default.

For V4-Flash-Base's HF quantization_config = {"quant_method": "fp8", "fmt": "e4m3", "weight_block_size": [128, 128], "scale_fmt": "ue8m0"}, the GenericParser (regex block|1x128) extracts (per_1x128, fp8) global spec, and step 2 hits → routed expert spec is (QuantType.per_1x128, dtypes.fp8). dtypes.fp8 from aiter resolves to float8_e4m3fnuz on gfx942 and float8_e4m3fn on gfx950 — picked correctly per platform without code changes.

MI308 specifics

KV pool slot sizes are identical to V4-Pro (584 B per token, FP8 NoPE 448 B + BF16 RoPE 128 B + 8 B UE8M0 scales).
The CSA indexer's K cache stays FP8 (132 B / token) regardless of routed-expert dtype.
Compressor / Indexer Triton kernels (fused_compress_attn, sparse_attn_v4_paged_decode) are SKU-agnostic.
Three-stream concurrency (main / alt / compress) works identically.
TP / EP sharding follows V4-Pro layout — n_routed_experts=256, top-k=6 matches the standard FusedMoE expert-shard math.

yitingw1 · 2026-06-01T06:50:18Z

Locally tested DeepSeek-V4-Flash-Base with 8xMI308 using above server scripts.
Here is the accuracy results:

scripts

lm_eval --model local-completions \
  --model_args model=deepseek-ai/DeepSeek-V4-Flash-Base,base_url=http://localhost:8000/v1/completions,num_concurrent=64,max_retries=3,tokenized_requests=False \
  --tasks gsm8k --num_fewshot 5

results:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8613	±	0.0095
		strict-match	5	exact_match	↑	0.8613	±	0.0095

junna2016 · 2026-06-01T09:37:37Z

launch server:

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
  PYTHONPATH=/workspace/aiter:/workspace/ATOM \
  AITER_USE_SYSTEM_TRITON=1 \
  AITER_LOG_LEVEL=WARNING \
  ATOM_FORCE_ATTN_TRITON=1 \
  python -m atom.entrypoints.openai_server \
    --model /models/DeepSeek-V4-Flash-Base \
    --kv_cache_dtype fp8 \
    --server-port 9677 \
    --max-model-len 16384 \
    --max-num-seqs 256 \
    --gpu-memory-utilization 0.95 \
    -tp 8

gsm8k test results：

script:

lm_eval --model local-completions \
  --model_args model=/models/DeepSeek-V4-Flash-Base,base_url=http://localhost:9677/v1/completions,num_concurrent=64,max_retries=3,tokenized_requests=False \
  --tasks gsm8k --num_fewshot 5

results:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8560	±	0.0097
		strict-match	5	exact_match	↑	0.8552	±	0.0097

with mtp=3

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.862	±	0.0095
		strict-match	5	exact_match	↑	0.862	±	0.0095

with ep=tp=8 mtp=3

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8666	±	0.0094
		strict-match	5	exact_match	↑	0.8673	±	0.0093

valarLip · 2026-06-01T12:11:30Z

thanks @junna2016 @yitingw1 , waiting ci

valarLip · 2026-06-01T12:12:45Z

emmm could you please fix this one ... https://github.com/ROCm/ATOM/actions/runs/26747521301/job/78849193273?pr=996

junna2016 · 2026-06-01T12:45:35Z

https://github.com/ROCm/ATOM/actions/runs/26747521301/job/78849193273?pr=996

Sure，I will format code with right style. Thanks for reminding.

yitingw1 · 2026-06-02T07:07:26Z

+    """
+
+    # ── 1. Explicit env override ──
+    forced = os.environ.get("ATOM_V4_ROUTED_QUANT", "").strip().lower()


Here we can use hf_config instead of introducing a new env "ATOM_V4_ROUTED_QUANT".
For example:

expert_dtype = getattr(hf_config, "expert_dtype", None) or "" if isinstance(expert_dtype, str): ed = expert_dtype.lower() if "fp4" in ed: return fp4_spec if "fp8" in ed: return fp8_block_spec