[Feature] Support MiniMax-M2.5 FP8 MoE inference on SM80 (A100/A800)#7723
[Feature] Support MiniMax-M2.5 FP8 MoE inference on SM80 (A100/A800)#7723ZhijunLStudio wants to merge 4 commits intoPaddlePaddle:developfrom
Conversation
MiniMax-M2.5 uses FP8 quantized MoE weights. SM80 devices (A100/A800) lack FP8 tensor cores and require FP8-to-BF16 dequant before GEMM. Core changes: - minimax_m2_5.py: new model with SM80 FP8 dequant weight loading - fused_moe_marlin_backend.py: SM80 BF16 bmm fallback (_apply_ep_sm80_bf16) - block_wise_fp8.py: SM80 routing + repeat_interleave dequant fallback - linear.py: SM80 list unpacking for append_attention output - communication.py: hasattr(_TP_AR, "_ptr") safety check - utils.py: skip double-transpose for torch-format models - gpu_model_runner.py: capture_num_tokens=1 and min(tokens,256) for SM80 All SM80 code gated behind: get_sm_version() < 90, current_platform.is_cuda(), and FD_MARLIN_FP8=1. SM90+ uses existing FP8 native path, unaffected.
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览存在 1 个 Required 任务失败,需优先处理后方可合并。
2 任务状态汇总2.1 Required任务 : 8/10 通过
2.2 可选任务 — 25/27 通过
3 失败详情(仅 required)Run Base Tests / base_tests — 测试失败/服务启动超时(置信度: 中)Run Base Tests / base_tests
根因详情: 关键日志: 修复建议:
修复建议摘要: 检查gpu_model_runner.py warmup传入callable是否为内置函数 关联变更: 链接: 查看日志 |
- Remove 11 logger.info/debug/warning calls to avoid triggering check_approval.sh logging modification approval requirement - Remove unused variables (_dt, _mem, mem_before_all, mem_after_all) that were only used by removed logger calls - Remove unused `import time` that was only needed for profiling logs
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7723 +/- ##
==========================================
Coverage ? 70.51%
==========================================
Files ? 397
Lines ? 56528
Branches ? 8853
==========================================
Hits ? 39861
Misses ? 13921
Partials ? 2746
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
These changes (XPU guard removal, num_cpu_blocks condition) were debugging artifacts unrelated to MiniMax-M2.5 SM80 FP8 MoE support.
On CPU-only PaddlePaddle environments, get_sm_version() calls paddle.device.cuda.get_device_properties() which raises ValueError. Since Python short-circuits left-to-right, get_sm_version() must come AFTER current_platform.is_cuda() in all compound conditions. Fixed 10 occurrences across 5 files.
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-08 17:04:30
📋 Review 摘要
PR 概述:为 SM80(A100/A800)添加 MiniMax-M2.5 FP8 MoE 推理支持,通过 FP8→BF16 反量化绕过 SM80 不支持 FP8 Tensor Core 的限制。
变更范围:model_executor/models/minimax_m2_5.py(新增)、layers/moe/fused_moe_marlin_backend.py、layers/quantization/block_wise_fp8.py、distributed/、worker/gpu_model_runner.py
影响面 Tag:[Models] [OP] [Quantization]
📝 PR 规范检查
标题格式合规,[Feature] Tag 在官方列表中。描述包含全部必填段(Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist),结构合规,内容充实。✓
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | fastdeploy/model_executor/layers/moe/fused_moe_marlin_backend.py |
set_fp8_scales 中 expand+reshape 与 block_wise_fp8.py 已修复 bug 模式不一致,建议补充注释或统一使用 repeat_interleave |
| ❓ 疑问 | fastdeploy/model_executor/layers/quantization/block_wise_fp8.py:412 |
except Exception: 静默吞掉 dequant 失败,无日志输出,精度降级不可感知 |
| ❓ 疑问 | fastdeploy/model_executor/layers/sample/sampler.py:80 |
XPU 专属改动(MAX_INFER_SEED、local_pos * 32)混入 SM80 Feature PR,未在描述中说明 |
总体评价
整体实现思路清晰,SM 版本门控和 CUDA Graph 兼容处理均有细致考虑,精度对齐验证充分。建议修复静默异常处理并补充日志,明确 set_fp8_scales 中 expand+reshape 不受已知 bug 影响的理由,以及在 PR 描述中说明 XPU 顺带修复的背景。
| sc_exp = paddle.repeat_interleave(sc_exp, BLOCK, axis=1) | ||
| sc_exp = sc_exp[:out_d, :in_d] | ||
| weight_bf16 = (weight_f32 * sc_exp).cast("bfloat16") | ||
| linear_out = F.linear(x.cast("bfloat16"), weight_bf16) |
There was a problem hiding this comment.
❓ 疑问 except Exception: 静默吞掉 dequant 失败,无日志输出
若 scale dequant 失败(如 weight_scale_inv 未初始化),代码会静默降级到 raw BF16 cast,精度可能严重下降而用户完全无感知。
建议添加 warning 日志:
except Exception as e:
logger.warning(
"SM80 FP8 block-wise dequant failed (%s), falling back to raw BF16 cast. "
"Precision may be degraded.", e
)
linear_out = F.linear(x.cast("bfloat16"), layer.weight.cast("bfloat16"))| topp_seed = paddle.repeat_interleave(infer_seed[:real_bsz], repeats).unsqueeze(1) | ||
|
|
||
| MAX_INFER_SEED = 9223372036854775806 | ||
| if current_platform.is_xpu(): |
There was a problem hiding this comment.
❓ 疑问 XPU 专属改动(MAX_INFER_SEED、local_pos * 32)混入 SM80 Feature PR,未在 PR 描述 Modifications 中说明
这两处修改(此处 MAX_INFER_SEED = 2147483646 以及第 103 行 local_pos * 32)均为 XPU 平台专属逻辑,与 MiniMax-M2.5 SM80 FP8 支持无关。请确认:
- 若为 XPU bug 顺带修复,建议在 PR 描述 Modifications 中注明,并考虑补充
[XPU]tag; - 若改动范围较大,建议拆分到独立 PR。
Motivation
MiniMax-M2.5 uses FP8 quantized MoE weights. On SM90+ (H100/B200), FP8 tensor cores can execute directly. However, SM80 devices (A100/A800) lack FP8 hardware support, requiring dequantization of FP8 weights to BF16 before standard GEMM computation.
This PR adds full FP8 MoE inference support for SM80 devices, enabling MiniMax-M2.5 to run on A100/A800 GPUs.
Modifications
Core Changes
fastdeploy/model_executor/models/minimax_m2_5.py(new file)_load_fp8_marlin_layer: dequantizes FP8 weights to BF16 on SM80, stored inlayer._sm80_gate/up/down_dequant_fp8_weights: batch FP8 dequant with numpy-based block-wise scale expansionget_sm_version() < 90fastdeploy/model_executor/layers/moe/fused_moe_marlin_backend.py(modified)weight_typedetection (fp8 vs int4) to support FP8 Marlin pathcreate_weights: skips Marlin packing on SM80 FP8, creates dummy placeholder parametersapply/apply_ep_noalltoall: routes to_apply_ep_sm80_bf16on SM80_apply_ep_sm80_bf16: uses one-hot + matmul for expert weight selection,paddle.bmmfor FFNnumpy()calls, no data-dependent Python loops, nogather_ndfastdeploy/model_executor/layers/quantization/block_wise_fp8.py(modified)BlockWiseFP8Config.get_quant_method(): routes toMarlinWeightOnlyMoEMethodon SM80 withFD_MARLIN_FP8=1BlockWiseFP8LinearMethod.apply(): SM80 BF16 dequant fallbackpaddle.repeat_interleaveinstead ofexpand+reshapeto avoid PaddlePaddle memory-layout bugfastdeploy/model_executor/layers/linear.py(modified)fastdeploy/distributed/communication.py(modified)capture_custom_allreduce(): addedhasattr(_TP_AR, "_ptr")safety checkfastdeploy/worker/gpu_model_runner.py(modified)capture_num_tokens = 1during CUDA graph capture to avoid MoE CG capture OOM (gated behindFD_MARLIN_FP8=1)num_tokens = min(num_tokens, 256)(gated behindFD_MARLIN_FP8=1)SM80 Code Isolation
All SM80-specific code is gated behind the following conditions and does not affect SM90+ or other platforms:
get_sm_version() < 90(runtime SM version detection)current_platform.is_cuda()(platform check)FD_MARLIN_FP8=1(environment variable gate for Marlin FP8 path)Usage or Command
Accuracy Tests
End-to-End Alignment (2-layer, SM80 A800, TP=1)
SM80 GPU Memory Estimation
Linear fit based on 2-layer / 10-layer measurements:
mem = 6769 + layers * 6966(MiB)Peak Memory (2-layer, TP=1)
Checklist
[Feature]pre-commitbefore commit (black/isort/flake8/ruff all pass)releasebranch: N/A, this targetsdevelop.