Add `moe_align_block_size_no_permute` for small batch size with large num_expert #30280

RunkaiTao · 2025-12-08T21:49:50Z

Purpose

This PR enhances the fused MoE implementation by adding moe_align_block_size_no_permute kernel. It introduces a small-batch fallback path designed for models with large numbers of experts, where the regular kernel becomes inefficient. For these regimes, the new path provides a speed-up by reducing unnecessary data movement and improving kernel efficiency for low-token workloads.

Test Result

Benchmark: Llama-4-Maverick-17B-128E (TP = 8)

Command:

python3 benchmarks/kernels/benchmark_moe.py \
  --model meta-llama/Llama-4-Maverick-17B-128E-Instruct \
  --tp-size 8

Results:

M	with permute (µs)	without permute (µs)	speed-up
1	33.14	29.56	-12.11%
2	36.05	32.73	-10.14%
4	49.42	46.22	-6.92%
8	76.77	73.2	-4.88%
16	136.29	132.32	-3.00%
24	183.81	178.01	-3.26%
32	234.63	228.54	-2.66%
48	309.59	309.73	0.05%
64	382.59	382.68	0.02%
96	513.39	512.00	-0.27%

Benchmark: DeepSeek-R1 (TP = 8)

Command:

python3 benchmarks/kernels/benchmark_moe.py \
  --model deepseek-ai/DeepSeek-R1 \
  --tp-size 8

Results:

M	with permute (µs)	without permute (µs)	speed-up
1	70.69	65.53	-6.25%
2	83.58	80.34	-4.03%
4	112.80	110.32	-2.25%
8	171.68	172.13	0.2%
16	338.09	338.50	0.12%

Benchmark: Qwen3-Coder-480B-A35B-Instruct-FP8 (TP = 8)

Command:

python3 benchmarks/kernels/benchmark_moe.py \
  --model Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
  --tp-size 8

Results

M	with permute (µs)	without permute (µs)	speed-up
1	69.76	66.06	-5.60%
2	85.90	82.24	-4.45%
4	116.18	114.19	-1.74%
8	176.28	176.36	0.004%
16	340.01	343.42	-0.01%

Benchmark: GPT-OSS-120B (TP = 4)

Command:

python3 benchmarks/kernels/benchmark_moe.py \
  --model openai/gpt-oss-120b \
  --tp-size 4

Results:

M	with permute (µs)	without permute (µs)	speed-up
1	49.55	46.41	-6.77%
2	68.18	65.01	-4.88%
4	120.50	118.2	-1.95%
8	227.08	227.45	0.16%
16	322.53	322.60	0.02%

Accuracy:
Pass pytest:

pytest tests/kernels/moe/test_moe.py::test_fused_moe

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>

gemini-code-assist

Code Review

This pull request introduces a new CUDA kernel, moe_align_block_size_no_permute, to optimize fused MoE for scenarios with small batch sizes and a large number of experts. The benchmark results demonstrate clear performance improvements for the targeted use cases. The implementation is sound, but I've identified a critical race condition in the new kernel that could lead to incorrect behavior. I've also suggested a minor improvement to enhance code readability by replacing a magic number with a named constant. Overall, this is a valuable addition once the identified issue is addressed.

gemini-code-assist · 2025-12-08T21:53:52Z

csrc/moe/moe_align_sum_kernels.cu

+  for (size_t it = tid; it < max_num_tokens_padded; it += stride) {
+    sorted_token_ids[it] = numel;
+  }
+


A __syncthreads() barrier is required here to prevent a race condition. The first loop initializes sorted_token_ids, and the second loop writes the final values. Without a barrier, threads from different warps can be in different loops, leading to the initialization writes overwriting the final values. This ensures all threads complete initialization before proceeding.

__syncthreads();

gemini-code-assist · 2025-12-08T21:53:52Z

csrc/moe/moe_align_sum_kernels.cu

  VLLM_DISPATCH_INTEGRAL_AND_UNSIGNED_TYPES(
      topk_ids.scalar_type(), "moe_align_block_size_kernel", [&] {
        // calc needed amount of shared mem for `cumsum` tensors
+        bool no_permute_mode = (topk_ids.numel() * 4 <= num_experts) && !has_expert_map;


The number 4 in the condition for no_permute_mode is a magic number. To improve code readability and maintainability, it should be defined as a named constant with a comment explaining its purpose. This makes the code easier to understand and modify in the future.

constexpr int NO_PERMUTE_MODE_EXPERT_FACTOR = 4; bool no_permute_mode = (topk_ids.numel() * NO_PERMUTE_MODE_EXPERT_FACTOR <= num_experts) && !has_expert_map;

Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>

mergify · 2025-12-09T02:57:37Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @RunkaiTao.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

RunkaiTao added 3 commits December 8, 2025 11:06

unpermute kernel cuda

e4fc343

Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>

pre-commit

bb5ee06

Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>

disable-expert-map

84b1b20

Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>

gemini-code-assist bot reviewed Dec 8, 2025

View reviewed changes

pre-commit

976bf1d

Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>

mergify bot added the needs-rebase label Dec 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add `moe_align_block_size_no_permute` for small batch size with large num_expert #30280

Add `moe_align_block_size_no_permute` for small batch size with large num_expert #30280

RunkaiTao commented Dec 8, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 8, 2025

Uh oh!

gemini-code-assist bot Dec 8, 2025

Uh oh!

mergify bot commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Add moe_align_block_size_no_permute for small batch size with large num_expert #30280

Are you sure you want to change the base?

Add moe_align_block_size_no_permute for small batch size with large num_expert #30280

Conversation

RunkaiTao commented Dec 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Result

Benchmark: Llama-4-Maverick-17B-128E (TP = 8)

Benchmark: DeepSeek-R1 (TP = 8)

Benchmark: Qwen3-Coder-480B-A35B-Instruct-FP8 (TP = 8)

Benchmark: GPT-OSS-120B (TP = 4)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add `moe_align_block_size_no_permute` for small batch size with large num_expert #30280

Add `moe_align_block_size_no_permute` for small batch size with large num_expert #30280

RunkaiTao commented Dec 8, 2025 •

edited by github-actions bot

Loading