Moe Reduce kernel #4228

grimoire · 2025-12-21T06:05:25Z

fill kv add scale fmt

lvhan028 · 2025-12-25T03:11:20Z

May resolve the conflict

Copilot

Pull request overview

This PR introduces a new MOE reduce kernel for better performance and adds support for scale format configuration in FP8 KV cache operations. The main changes involve refactoring weight multiplication in MOE operations from inline kernel operations to a dedicated reduction kernel, and adding scale rounding support for FP8 quantization.

Implements a new moe_reduce kernel to handle weighted reduction of expert outputs separately from the main MOE computation
Adds scale_fmt parameter to FP8 quantization functions to support different scale formats (specifically 'ue8m0' for rounded scales)
Removes inline weight multiplication from MOE kernels, delegating this to the new reduction kernel

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
lmdeploy/pytorch/kernels/cuda/fused_moe.py	Adds new `moe_reduce` kernel and helper function; removes weights/enable_weights parameters from `fused_moe_kernel` and `fused_moe_kernel_launcher`; replaces `.sum()` with `moe_reduce()` call
lmdeploy/pytorch/kernels/cuda/w8a8_fused_moe.py	Removes weights/enable_weights parameters from kernel and launcher; imports and uses `moe_reduce` instead of `.sum()`
lmdeploy/pytorch/kernels/cuda/blocked_fp8_fused_moe.py	Similar changes as w8a8_fused_moe.py - removes weights parameters and uses `moe_reduce`
lmdeploy/pytorch/kernels/cuda/fill_kv_cache.py	Adds scale rounding helper functions (fast_log2_ceil, fast_pow2, fast_round_scale); adds ROUND_SCALE parameter to quantization; adds scale_fmt parameter support
lmdeploy/pytorch/backends/cuda/nsa.py	Adds scale_fmt configuration and passes it to quant_fp8 and fill_kv_cache_blocked_fp8 calls
lmdeploy/pytorch/backends/cuda/blockedf8_modules.py	Removes incorrect scale_fmt parameter from blocked_gemm_fp8 call (function doesn't accept this parameter)
lmdeploy/pytorch/backends/cuda/attention.py	Adds scale_fmt='ue8m0' parameter to fill_kv_cache_blocked_fp8 call
tests/pytorch/kernel/test_fused_moe.py	Removes weights and enable_weights fixtures; updates test ground truth to not apply weights (as kernel no longer does this)
tests/pytorch/kernel/test_fuse_moe_blocked_fp8.py	Similar test updates as test_fused_moe.py
tests/pytorch/kernel/test_fill_kv_cache.py	Adds scale_fmt fixture and parametrizes test with [None, 'ue8m0']; threads scale_fmt through gt fixture and test calls

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

lmdeploy/pytorch/kernels/cuda/fill_kv_cache.py

lmdeploy/pytorch/kernels/cuda/fused_moe.py

tests/pytorch/kernel/test_fill_kv_cache.py

RunningLeon · 2025-12-26T10:14:58Z

lmdeploy/pytorch/kernels/cuda/fused_moe.py

+
+
+def moe_reduce(hidden_states: torch.Tensor, topk_weights: torch.Tensor, fp32_acc: bool = False) -> torch.Tensor:
+    """Moe reduce."""


should we set fp32_acc=True to align with before?

There is no such "before". Different moe kernels use different acc

RunningLeon

LGTM

grimoire added 5 commits December 18, 2025 12:34

fuse moe reduce and weight

f82f82b

Merge branch 'main' into moe_reduce_kernel

7fe2411

fix fill kv scale fmt

5acccd1

Merge branch 'moe_reduce_kernel' into fix-dsv32

68a7cef

fix scale fmt

a1871ea

grimoire changed the title ~~Fix dsv32~~ Moe Reduce kernel Dec 23, 2025

lvhan028 requested review from RunningLeon and Copilot December 25, 2025 03:11

Copilot started reviewing on behalf of lvhan028 December 25, 2025 03:11 View session

lvhan028 added the improvement label Dec 25, 2025

Copilot AI reviewed Dec 25, 2025

View reviewed changes

lmdeploy/pytorch/kernels/cuda/fill_kv_cache.py Show resolved Hide resolved

lmdeploy/pytorch/kernels/cuda/fused_moe.py Outdated Show resolved Hide resolved

tests/pytorch/kernel/test_fill_kv_cache.py Show resolved Hide resolved

merge main

2698042

lvhan028 requested a review from CUHKSZzxy December 25, 2025 12:23

RunningLeon reviewed Dec 26, 2025

View reviewed changes

CUHKSZzxy approved these changes Dec 26, 2025

View reviewed changes

RunningLeon approved these changes Dec 29, 2025

View reviewed changes

lvhan028 merged commit b0befc3 into InternLM:main Dec 31, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Moe Reduce kernel #4228

Moe Reduce kernel #4228

Uh oh!

grimoire commented Dec 21, 2025

Uh oh!

lvhan028 commented Dec 25, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RunningLeon Dec 26, 2025

Uh oh!

grimoire Dec 26, 2025

Uh oh!

RunningLeon left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants



		def moe_reduce(hidden_states: torch.Tensor, topk_weights: torch.Tensor, fp32_acc: bool = False) -> torch.Tensor:
		"""Moe reduce."""

Moe Reduce kernel #4228

Moe Reduce kernel #4228

Uh oh!

Conversation

grimoire commented Dec 21, 2025

Uh oh!

lvhan028 commented Dec 25, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RunningLeon Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

grimoire Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

RunningLeon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants