-
-
Notifications
You must be signed in to change notification settings - Fork 11.9k
Add moe_align_block_size_no_permute for small batch size with large num_expert
#30280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new CUDA kernel, moe_align_block_size_no_permute, to optimize fused MoE for scenarios with small batch sizes and a large number of experts. The benchmark results demonstrate clear performance improvements for the targeted use cases. The implementation is sound, but I've identified a critical race condition in the new kernel that could lead to incorrect behavior. I've also suggested a minor improvement to enhance code readability by replacing a magic number with a named constant. Overall, this is a valuable addition once the identified issue is addressed.
| for (size_t it = tid; it < max_num_tokens_padded; it += stride) { | ||
| sorted_token_ids[it] = numel; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A __syncthreads() barrier is required here to prevent a race condition. The first loop initializes sorted_token_ids, and the second loop writes the final values. Without a barrier, threads from different warps can be in different loops, leading to the initialization writes overwriting the final values. This ensures all threads complete initialization before proceeding.
__syncthreads();
csrc/moe/moe_align_sum_kernels.cu
Outdated
| VLLM_DISPATCH_INTEGRAL_AND_UNSIGNED_TYPES( | ||
| topk_ids.scalar_type(), "moe_align_block_size_kernel", [&] { | ||
| // calc needed amount of shared mem for `cumsum` tensors | ||
| bool no_permute_mode = (topk_ids.numel() * 4 <= num_experts) && !has_expert_map; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number 4 in the condition for no_permute_mode is a magic number. To improve code readability and maintainability, it should be defined as a named constant with a comment explaining its purpose. This makes the code easier to understand and modify in the future.
constexpr int NO_PERMUTE_MODE_EXPERT_FACTOR = 4;
bool no_permute_mode = (topk_ids.numel() * NO_PERMUTE_MODE_EXPERT_FACTOR <= num_experts) && !has_expert_map;
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>
|
This pull request has merge conflicts that must be resolved before it can be |
Purpose
This PR enhances the fused MoE implementation by adding
moe_align_block_size_no_permutekernel. It introduces a small-batch fallback path designed for models with large numbers of experts, where the regular kernel becomes inefficient. For these regimes, the new path provides a speed-up by reducing unnecessary data movement and improving kernel efficiency for low-token workloads.Test Result
Benchmark: Llama-4-Maverick-17B-128E (TP = 8)
Command:
Results:
Benchmark: DeepSeek-R1 (TP = 8)
Command:
Results:
Benchmark: Qwen3-Coder-480B-A35B-Instruct-FP8 (TP = 8)
Command:
Results
Benchmark: GPT-OSS-120B (TP = 4)
Command:
Results:
Accuracy:
Pass pytest:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.