-
Notifications
You must be signed in to change notification settings - Fork 641
Moe Reduce kernel #4228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Moe Reduce kernel #4228
Conversation
|
May resolve the conflict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a new MOE reduce kernel for better performance and adds support for scale format configuration in FP8 KV cache operations. The main changes involve refactoring weight multiplication in MOE operations from inline kernel operations to a dedicated reduction kernel, and adding scale rounding support for FP8 quantization.
- Implements a new
moe_reducekernel to handle weighted reduction of expert outputs separately from the main MOE computation - Adds
scale_fmtparameter to FP8 quantization functions to support different scale formats (specifically 'ue8m0' for rounded scales) - Removes inline weight multiplication from MOE kernels, delegating this to the new reduction kernel
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| lmdeploy/pytorch/kernels/cuda/fused_moe.py | Adds new moe_reduce kernel and helper function; removes weights/enable_weights parameters from fused_moe_kernel and fused_moe_kernel_launcher; replaces .sum() with moe_reduce() call |
| lmdeploy/pytorch/kernels/cuda/w8a8_fused_moe.py | Removes weights/enable_weights parameters from kernel and launcher; imports and uses moe_reduce instead of .sum() |
| lmdeploy/pytorch/kernels/cuda/blocked_fp8_fused_moe.py | Similar changes as w8a8_fused_moe.py - removes weights parameters and uses moe_reduce |
| lmdeploy/pytorch/kernels/cuda/fill_kv_cache.py | Adds scale rounding helper functions (fast_log2_ceil, fast_pow2, fast_round_scale); adds ROUND_SCALE parameter to quantization; adds scale_fmt parameter support |
| lmdeploy/pytorch/backends/cuda/nsa.py | Adds scale_fmt configuration and passes it to quant_fp8 and fill_kv_cache_blocked_fp8 calls |
| lmdeploy/pytorch/backends/cuda/blockedf8_modules.py | Removes incorrect scale_fmt parameter from blocked_gemm_fp8 call (function doesn't accept this parameter) |
| lmdeploy/pytorch/backends/cuda/attention.py | Adds scale_fmt='ue8m0' parameter to fill_kv_cache_blocked_fp8 call |
| tests/pytorch/kernel/test_fused_moe.py | Removes weights and enable_weights fixtures; updates test ground truth to not apply weights (as kernel no longer does this) |
| tests/pytorch/kernel/test_fuse_moe_blocked_fp8.py | Similar test updates as test_fused_moe.py |
| tests/pytorch/kernel/test_fill_kv_cache.py | Adds scale_fmt fixture and parametrizes test with [None, 'ue8m0']; threads scale_fmt through gt fixture and test calls |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
|
|
||
| def moe_reduce(hidden_states: torch.Tensor, topk_weights: torch.Tensor, fp32_acc: bool = False) -> torch.Tensor: | ||
| """Moe reduce.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we set fp32_acc=True to align with before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no such "before". Different moe kernels use different acc
RunningLeon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
fill kv add scale fmt