feat: support DeepSeek-V4-Flash-Base model on gfx942 device.#996
Open
junna2016 wants to merge 1 commit into
Open
feat: support DeepSeek-V4-Flash-Base model on gfx942 device.#996junna2016 wants to merge 1 commit into
junna2016 wants to merge 1 commit into
Conversation
yitingw1
reviewed
Jun 1, 2026
yitingw1
reviewed
Jun 1, 2026
yitingw1
reviewed
Jun 1, 2026
yitingw1
reviewed
Jun 1, 2026
yitingw1
reviewed
Jun 1, 2026
yitingw1
reviewed
Jun 1, 2026
yitingw1
reviewed
Jun 1, 2026
|
Locally tested DeepSeek-V4-Flash-Base with 8xMI308 using above server scripts.
lm_eval --model local-completions \
--model_args model=deepseek-ai/DeepSeek-V4-Flash-Base,base_url=http://localhost:8000/v1/completions,num_concurrent=64,max_retries=3,tokenized_requests=False \
--tasks gsm8k --num_fewshot 5
|
Author
|
launch server: export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
PYTHONPATH=/workspace/aiter:/workspace/ATOM \
AITER_USE_SYSTEM_TRITON=1 \
AITER_LOG_LEVEL=WARNING \
ATOM_FORCE_ATTN_TRITON=1 \
python -m atom.entrypoints.openai_server \
--model /models/DeepSeek-V4-Flash-Base \
--kv_cache_dtype fp8 \
--server-port 9677 \
--max-model-len 16384 \
--max-num-seqs 256 \
--gpu-memory-utilization 0.95 \
-tp 8gsm8k test results:
lm_eval --model local-completions \
--model_args model=/models/DeepSeek-V4-Flash-Base,base_url=http://localhost:9677/v1/completions,num_concurrent=64,max_retries=3,tokenized_requests=False \
--tasks gsm8k --num_fewshot 5
|
4e071d1 to
d258279
Compare
Collaborator
|
thanks @junna2016 @yitingw1 , waiting ci |
Collaborator
|
emmm could you please fix this one ... https://github.com/ROCm/ATOM/actions/runs/26747521301/job/78849193273?pr=996 |
Author
Sure,I will format code with right style. Thanks for reminding. |
d258279 to
f7ab793
Compare
f7ab793 to
16814ff
Compare
yitingw1
reviewed
Jun 2, 2026
| """ | ||
|
|
||
| # ── 1. Explicit env override ── | ||
| forced = os.environ.get("ATOM_V4_ROUTED_QUANT", "").strip().lower() |
There was a problem hiding this comment.
Here we can use hf_config instead of introducing a new env "ATOM_V4_ROUTED_QUANT".
For example:
expert_dtype = getattr(hf_config, "expert_dtype", None) or ""
if isinstance(expert_dtype, str):
ed = expert_dtype.lower()
if "fp4" in ed:
return fp4_spec
if "fp8" in ed:
return fp8_block_spec
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
MI308 support DeepSeek-v4 model with fp8 moe
FP8 on MI308 / gfx942 (V4-Flash-Base, FP8 per-block routed experts)
DeepSeek-V4-Flash-Base ships the same V4 architecture (mHC + CSA + HCA + sparse attn + MTP) as V4-Pro, but routed experts are FP8 e4m3 per-block 128×128 (instead of V4-Pro's FP4 e2m1 microscaling). This trades a small expert-memory increase for end-to-end ROCm
gfx942(MI308) compatibility —aiter's FP8 grouped GEMM has been tuned forgfx942, while the FP4 path was authored forgfx950(MI355X).The routed-expert quant scheme is auto-detected from the HF
quantization_configdict:quant_methodquark(with FP4 layer pattern)fp8fmte2m1e4m3weight_block_size[128, 128]scale_fmtue8m0ue8m0Override knobs (escape hatches, normally not needed):
ATOM_V4_ROUTED_QUANT={fp4,fp8_block}— forces the routed-expert path. Useful for debugging or when the auto-detection picks the wrong scheme.fp8andfp8_per_blockare valid aliases forfp8_block.ATOM_V4_DISABLE_FUSED_SHARED=1— disables the aiter fused shared+routed expert kernel. On V4-Flash-Base both routed and shared experts are FP8 (matching dtype), so the framework auto-enables fusion. If you hit numerical instabilities or kernel issues on a specific GPU, set this to 1 to keep them as 2 separate kernels.ATOM_USE_TRITON_MOE=1—gfx942defaults to Triton MoE automatically (no need to set), but it doesn't hurt to set explicitly. Required ongfx950for V4-Pro (see V4-Pro section above).Auto-detection logic
The routed-expert quant spec is resolved in this priority order (see
_detect_v4_routed_quant_spec):ATOM_V4_ROUTED_QUANTenv override — explicit forcing.quantization_config.layer_quant_config(Quark) or global config (compressed-tensors / generic) directly produces a per-layer spec forffn.experts.*.w*, that wins.quant_method/fmt— strings containingfp8→ FP8 block;fp4/mxfp4→ FP4.For V4-Flash-Base's HF
quantization_config = {"quant_method": "fp8", "fmt": "e4m3", "weight_block_size": [128, 128], "scale_fmt": "ue8m0"}, the GenericParser (regexblock|1x128) extracts(per_1x128, fp8)global spec, and step 2 hits → routed expert spec is(QuantType.per_1x128, dtypes.fp8).dtypes.fp8fromaiterresolves tofloat8_e4m3fnuzongfx942andfloat8_e4m3fnongfx950— picked correctly per platform without code changes.MI308 specifics
fused_compress_attn,sparse_attn_v4_paged_decode) are SKU-agnostic.n_routed_experts=256, top-k=6matches the standard FusedMoE expert-shard math.