[Feature] Support MiniMax-M2.5 FP8 MoE inference on SM80 (A100/A800) by ZhijunLStudio · Pull Request #7723 · PaddlePaddle/FastDeploy

ZhijunLStudio · 2026-05-06T04:52:23Z

Motivation

MiniMax-M2.5 uses FP8 quantized MoE weights. On SM90+ (H100/B200), FP8 tensor cores can execute directly. However, SM80 devices (A100/A800) lack FP8 hardware support, requiring dequantization of FP8 weights to BF16 before standard GEMM computation.

This PR adds full FP8 MoE inference support for SM80 devices, enabling MiniMax-M2.5 to run on A100/A800 GPUs.

Platform behavior: All SM80-specific paths are gated behind get_sm_version() < 90. SM90+ (H100/B200) uses the existing FP8 native path and is unaffected. SM89 (Ada/RTX 4090) also falls into the < 90 branch — it has FP8 tensor cores but this codebase does not yet distinguish SM89 from SM80 for FP8 MoE; Ada users can run on the standard path by leaving FD_MARLIN_FP8 unset.

Modifications

Core Changes

fastdeploy/model_executor/models/minimax_m2_5.py (new file)

MiniMax-M2.5 model implementation with SM80 FP8->BF16 dequant weight loading
_load_fp8_marlin_layer: dequantizes FP8 weights to BF16 on SM80, stored in layer._sm80_gate/up/down
_dequant_fp8_weights: batch FP8 dequant with numpy-based block-wise scale expansion
All SM80 logic gated behind get_sm_version() < 90

fastdeploy/model_executor/layers/moe/fused_moe_marlin_backend.py (modified)

Added weight_type detection (fp8 vs int4) to support FP8 Marlin path
create_weights: skips Marlin packing on SM80 FP8, creates dummy placeholder parameters
apply / apply_ep_noalltoall: routes to _apply_ep_sm80_bf16 on SM80
Added _apply_ep_sm80_bf16: uses one-hot + matmul for expert weight selection, paddle.bmm for FFN
CUDA graph compatible: no numpy() calls, no data-dependent Python loops, no gather_nd

fastdeploy/model_executor/layers/quantization/block_wise_fp8.py (modified)

BlockWiseFP8Config.get_quant_method(): routes to MarlinWeightOnlyMoEMethod on SM80 with FD_MARLIN_FP8=1
BlockWiseFP8LinearMethod.apply(): SM80 BF16 dequant fallback
- Uses paddle.repeat_interleave instead of expand+reshape to avoid PaddlePaddle memory-layout bug
- Falls back to raw FP8->BF16 cast when block-wise scale tensor is uninitialized

fastdeploy/model_executor/layers/linear.py (modified)

SM80 list unpacking for attention output (append_attention returns list on SM80)

fastdeploy/distributed/communication.py (modified)

capture_custom_allreduce(): added hasattr(_TP_AR, "_ptr") safety check

fastdeploy/worker/gpu_model_runner.py (modified)

Sets capture_num_tokens = 1 during CUDA graph capture to avoid MoE CG capture OOM (gated behind FD_MARLIN_FP8=1)
Limits dummy run to num_tokens = min(num_tokens, 256) (gated behind FD_MARLIN_FP8=1)

SM80 Code Isolation

All SM80-specific code is gated behind the following conditions and does not affect SM90+ or other platforms:

get_sm_version() < 90 (runtime SM version detection)
current_platform.is_cuda() (platform check)
FD_MARLIN_FP8=1 (environment variable gate for Marlin FP8 path)

Usage or Command

# Run MiniMax-M2.5 on SM80 (A100/A800)
FD_MARLIN_FP8=1 python -c "
from fastdeploy.entrypoints.llm import LLM
from fastdeploy.engine.sampling_params import SamplingParams

llm = LLM(
    model='/path/to/MiniMax-M2.5',
    tensor_parallel_size=1,
    graph_optimization_config={'use_cudagraph': False},
    max_model_len=2048,
    quantization='wint4',
    enable_expert_parallel=True,
)
outputs = llm.generate(['Hello, how are you?'], SamplingParams(temperature=0, max_tokens=20))
print('Token IDs:', outputs[0].outputs.token_ids)
"

Accuracy Tests

End-to-End Alignment (2-layer, SM80 A800, TP=1)

Config	Load Time	Inference Time	Token Output
No CUDA Graph	99.0s	58.4s	[367]x19 + [200020]
CUDA Graph	96.1s	67.7s	[367]x19 + [200020]

CUDA Graph and non-CG produce identical output
First 19 tokens match vLLM baseline (all 367, i.e. '\n')
Last token difference (200020 vs 367) is caused by upstream stop token detection fix, not this PR

SM80 GPU Memory Estimation

Linear fit based on 2-layer / 10-layer measurements: mem = 6769 + layers * 6966 (MiB)

Config	Per-layer Memory	Total Estimate
2-layer TP=1	~6.8 GB/layer	~68.4 GB
10-layer TP=2	~6.8 GB/layer	~40 GB/GPU
62-layer TP=8 (est.)	~6.8 GB/layer	~54 GB/GPU + KV cache

Note: SM80 BF16 dequant workaround expands all FP8 weights to BF16, approximately doubling memory usage vs native FP8. Full 62-layer model requires at least 8xA100 80GB (TP=8).

Peak Memory (2-layer, TP=1)

Phase	No CG	CG
Weight Loading	~68 GB	~68 GB
Inference Peak	~68 GB	~71 GB

Checklist

Add at least a tag in the PR title: [Feature]
Format your code, run pre-commit before commit (black/isort/flake8/ruff all pass)
Add unit tests: The original SM80 e2e test was removed per reviewer recommendation — it is not pytest-compatible and requires a specific model checkpoint plus 80GB GPU (A100) which the standard CI environment does not meet. Accuracy is validated via manual end-to-end alignment tests (see Accuracy Tests section).
Provide accuracy results (token alignment with vLLM baseline)
If the current PR is submitting to the release branch: N/A, this targets develop.

MiniMax-M2.5 uses FP8 quantized MoE weights. SM80 devices (A100/A800) lack FP8 tensor cores and require FP8-to-BF16 dequant before GEMM. Core changes: - minimax_m2_5.py: new model with SM80 FP8 dequant weight loading - fused_moe_marlin_backend.py: SM80 BF16 bmm fallback (_apply_ep_sm80_bf16) - block_wise_fp8.py: SM80 routing + repeat_interleave dequant fallback - linear.py: SM80 list unpacking for append_attention output - communication.py: hasattr(_TP_AR, "_ptr") safety check - utils.py: skip double-transpose for torch-format models - gpu_model_runner.py: capture_num_tokens=1 and min(tokens,256) for SM80 All SM80 code gated behind: get_sm_version() < 90, current_platform.is_cuda(), and FD_MARLIN_FP8=1. SM90+ uses existing FP8 native path, unaffected.

paddle-bot · 2026-05-06T04:52:32Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-05-06T05:09:31Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-08 17:18:51

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 2a05adc
Merge base: d70f33d (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

存在 1 个 Required 任务失败，需优先处理后方可合并。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
38(0)	38	33	3	0	0	1

2 任务状态汇总

2.1 Required任务 : 8/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Run Base Tests / base_tests`	11m37s	PR问题：gpu_model_runner.py改动致warmup new_code为None	检查warmup路径传入callable是否为内置函数	Job	-
⚠️	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	-	上游任务失败后工作流被取消	修复 base_tests 后重新触发	-	-
✅	其余 8 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 25/27 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	10m28s	Job	-
❌	`CI_HPU`	1h28m	Job	-
✅	其余 25 个可选任务通过	-	-	-

3 失败详情（仅 required）

Run Base Tests / base_tests — 测试失败/服务启动超时（置信度: 中）

Run Base Tests / base_tests

状态: ❌ 失败
错误类型: 测试失败（服务启动超时）
置信度: 中
根因摘要: PR改动gpu_model_runner.py，warmup时new_code对象为None
分析器: ci_analyze_unittest_fastdeploy

根因详情:
服务启动超时（360秒），Worker 进程在 CUDA Graph warm-up 阶段崩溃。在 graph_optimization_backend.py:74 的 warmup_impl 函数中，new_code 对象为 None，导致访问 .co_names 时抛出 AttributeError。该路径由 capture_model_prefill_and_mixed → _dummy_run → ernie4_5_moe.forward → cudagraph_piecewise_backend → warmup_impl 调用链触发。PR 修改了 gpu_model_runner.py（diff 因体积超限被截断），可能改变了 CUDA Graph 捕获/预热逻辑，导致某个 callable 对象的 __code__ 返回 None（如传入了内置函数/C扩展）。

关键日志:

File ".../graph_optimization_backend.py", line 74, in warmup_impl
    if any(name.startswith("$") for name in new_code.co_names):
AttributeError: 'NoneType' object has no attribute 'co_names'
ERROR: Failed to initialize FastDeploy LLM engine, service exit now!
服务启动超时，耗时：[360s]

修复建议:

检查 fastdeploy/worker/gpu_model_runner.py 中 capture_model_prefill_and_mixed / _dummy_run 的改动，确保传入 CUDA Graph 捕获路径的 callable 对象为普通 Python 函数（具有有效 __code__ 属性），而非 built-in 或 C 扩展函数
若 fused_moe_marlin_backend.py 中新增的 _swiglu 函数被作为 activation callable 传入 graph optimizer，需注意：其内部调用 paddle.nn.functional.swiglu（C 扩展，无 __code__），在 SM80 fallback 路径下可能触发此问题，建议在传入前检查 callable.__code__ is not None

修复建议摘要: 检查gpu_model_runner.py warmup传入callable是否为内置函数

关联变更: fastdeploy/worker/gpu_model_runner.py（diff截断，需人工审查）、fastdeploy/model_executor/layers/moe/fused_moe_marlin_backend.py L24-55（新增_swiglu）

链接: 查看日志

- Remove 11 logger.info/debug/warning calls to avoid triggering check_approval.sh logging modification approval requirement - Remove unused variables (_dt, _mem, mem_before_all, mem_after_all) that were only used by removed logger calls - Remove unused `import time` that was only needed for profiling logs

codecov-commenter · 2026-05-06T07:31:44Z

Codecov Report

❌ Patch coverage is 7.36630% with 918 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@d70f33d). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/model_executor/models/minimax_m2_5.py	10.20%	607 Missing ⚠️
...el_executor/layers/moe/fused_moe_marlin_backend.py	0.00%	268 Missing ⚠️
...del_executor/layers/quantization/block_wise_fp8.py	0.00%	25 Missing and 2 partials ⚠️
fastdeploy/model_executor/layers/linear.py	0.00%	6 Missing and 2 partials ⚠️
fastdeploy/worker/gpu_model_runner.py	33.33%	2 Missing and 2 partials ⚠️
fastdeploy/model_executor/layers/sample/sampler.py	33.33%	1 Missing and 1 partial ⚠️
fastdeploy/model_executor/utils.py	0.00%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7723   +/-   ##
==========================================
  Coverage           ?   70.51%           
==========================================
  Files              ?      397           
  Lines              ?    56528           
  Branches           ?     8853           
==========================================
  Hits               ?    39861           
  Misses             ?    13921           
  Partials           ?     2746

Flag	Coverage Δ
GPU	`70.51% <7.36%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

These changes (XPU guard removal, num_cpu_blocks condition) were debugging artifacts unrelated to MiniMax-M2.5 SM80 FP8 MoE support.

On CPU-only PaddlePaddle environments, get_sm_version() calls paddle.device.cuda.get_device_properties() which raises ValueError. Since Python short-circuits left-to-right, get_sm_version() must come AFTER current_platform.is_cuda() in all compound conditions. Fixed 10 occurrences across 5 files.

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-08 17:04:30

📋 Review 摘要

PR 概述：为 SM80（A100/A800）添加 MiniMax-M2.5 FP8 MoE 推理支持，通过 FP8→BF16 反量化绕过 SM80 不支持 FP8 Tensor Core 的限制。
变更范围：model_executor/models/minimax_m2_5.py（新增）、layers/moe/fused_moe_marlin_backend.py、layers/quantization/block_wise_fp8.py、distributed/、worker/gpu_model_runner.py
影响面 Tag：[Models] [OP] [Quantization]

📝 PR 规范检查

标题格式合规，[Feature] Tag 在官方列表中。描述包含全部必填段（Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist），结构合规，内容充实。✓

问题

级别	文件	概述
🟡 建议	`fastdeploy/model_executor/layers/moe/fused_moe_marlin_backend.py`	`set_fp8_scales` 中 `expand`+`reshape` 与 `block_wise_fp8.py` 已修复 bug 模式不一致，建议补充注释或统一使用 `repeat_interleave`
❓ 疑问	`fastdeploy/model_executor/layers/quantization/block_wise_fp8.py:412`	`except Exception:` 静默吞掉 dequant 失败，无日志输出，精度降级不可感知
❓ 疑问	`fastdeploy/model_executor/layers/sample/sampler.py:80`	XPU 专属改动（`MAX_INFER_SEED`、`local_pos * 32`）混入 SM80 Feature PR，未在描述中说明

总体评价

整体实现思路清晰，SM 版本门控和 CUDA Graph 兼容处理均有细致考虑，精度对齐验证充分。建议修复静默异常处理并补充日志，明确 set_fp8_scales 中 expand+reshape 不受已知 bug 影响的理由，以及在 PR 描述中说明 XPU 顺带修复的背景。

PaddlePaddle-bot · 2026-05-08T09:11:47Z

+                    sc_exp = paddle.repeat_interleave(sc_exp, BLOCK, axis=1)
+                    sc_exp = sc_exp[:out_d, :in_d]
+                    weight_bf16 = (weight_f32 * sc_exp).cast("bfloat16")
+                    linear_out = F.linear(x.cast("bfloat16"), weight_bf16)


❓ 疑问 except Exception: 静默吞掉 dequant 失败，无日志输出

若 scale dequant 失败（如 weight_scale_inv 未初始化），代码会静默降级到 raw BF16 cast，精度可能严重下降而用户完全无感知。

建议添加 warning 日志：

except Exception as e: logger.warning( "SM80 FP8 block-wise dequant failed (%s), falling back to raw BF16 cast. " "Precision may be degraded.", e ) linear_out = F.linear(x.cast("bfloat16"), layer.weight.cast("bfloat16"))

PaddlePaddle-bot · 2026-05-08T09:11:48Z

    topp_seed = paddle.repeat_interleave(infer_seed[:real_bsz], repeats).unsqueeze(1)

-    MAX_INFER_SEED = 9223372036854775806
+    if current_platform.is_xpu():


❓ 疑问 XPU 专属改动（MAX_INFER_SEED、local_pos * 32）混入 SM80 Feature PR，未在 PR 描述 Modifications 中说明

这两处修改（此处 MAX_INFER_SEED = 2147483646 以及第 103 行 local_pos * 32）均为 XPU 平台专属逻辑，与 MiniMax-M2.5 SM80 FP8 支持无关。请确认：

若为 XPU bug 顺带修复，建议在 PR 描述 Modifications 中注明，并考虑补充 [XPU] tag；

若改动范围较大，建议拆分到独立 PR。

ZhijunLStudio had a problem deploying to Metax_ci May 6, 2026 04:52 — with GitHub Actions Failure

paddle-bot Bot added the contributor External developers label May 6, 2026

This comment was marked as outdated.

Sign in to view

ZhijunLStudio had a problem deploying to Metax_ci May 6, 2026 06:04 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

Revert sampler.py and mtp.py to upstream version

89dd977

These changes (XPU guard removal, num_cpu_blocks condition) were debugging artifacts unrelated to MiniMax-M2.5 SM80 FP8 MoE support.

ZhijunLStudio had a problem deploying to Metax_ci May 7, 2026 05:50 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

ZhijunLStudio temporarily deployed to Metax_ci May 7, 2026 07:34 — with GitHub Actions Inactive

luotao1 added the PaddlePaddle Hackathon label May 8, 2026

luotao1 self-assigned this May 8, 2026

luotao1 removed the PaddlePaddle Hackathon label May 8, 2026

PaddlePaddle-bot reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support MiniMax-M2.5 FP8 MoE inference on SM80 (A100/A800)#7723

[Feature] Support MiniMax-M2.5 FP8 MoE inference on SM80 (A100/A800)#7723
ZhijunLStudio wants to merge 4 commits intoPaddlePaddle:developfrom
ZhijunLStudio:feat/minimax-sm80-clean

ZhijunLStudio commented May 6, 2026 •

edited

Loading

Uh oh!

paddle-bot Bot commented May 6, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 6, 2026 •

edited

Loading

Run Base Tests / base_tests

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented May 6, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 8, 2026

Uh oh!

PaddlePaddle-bot May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ZhijunLStudio commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Core Changes

SM80 Code Isolation

Usage or Command

Accuracy Tests

End-to-End Alignment (2-layer, SM80 A800, TP=1)

SM80 GPU Memory Estimation

Peak Memory (2-layer, TP=1)

Checklist

Uh oh!

paddle-bot Bot commented May 6, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 8/10 通过

2.2 可选任务 — 25/27 通过

3 失败详情（仅 required）

Run Base Tests / base_tests

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ZhijunLStudio commented May 6, 2026 •

edited

Loading

PaddlePaddle-bot commented May 6, 2026 •

edited

Loading

codecov-commenter commented May 6, 2026 •

edited

Loading