[Feature] Support mtp overlap schedule#7001
[Feature] Support mtp overlap schedule#7001Sunny-bot1 wants to merge 9 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7001 +/- ##
==========================================
Coverage ? 71.84%
==========================================
Files ? 399
Lines ? 56105
Branches ? 8854
==========================================
Hits ? 40306
Misses ? 12945
Partials ? 2854
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
fastdeploy-bot
left a comment
There was a problem hiding this comment.
AI CI Agent | skill:
pr_review_agent
This PR implements MTP (Multi-Token Prediction) overlap scheduling support with significant refactoring of CUDA kernels to use cooperative groups for better parallelization. The changes include removing CPU-GPU copies for better performance, adding defensive checks for negative batch indices, and modifying API signatures.
I found 1 P1 logic bug that needs to be addressed before merging.
| @@ -1167,6 +1196,17 @@ def _get_self_hidden_states(self, hidden_states): | |||
| ) | |||
There was a problem hiding this comment.
P1 - API Mismatch Bug: The eagle_get_self_hidden_states kernel API was changed to expect seq_lens_encoder as the 4th parameter (see eagle_get_self_hidden_states.cu line 225), but this XPU branch is still passing step_idx. The kernel now uses seq_lens_encoder[t] > 0 to detect encoder phase instead of the old step_idx[i] == 1 check. This will produce incorrect results on XPU platform. Should be self.model_inputs.last_seq_lens_encoder (similar to the CUDA branch at line 1205) instead of self.model_inputs["step_idx"].
| @@ -91,19 +91,20 @@ def test_eagle_get_self_hidden_states(self): | |||
|
|
|||
There was a problem hiding this comment.
P1 - Test Inconsistency: The test passes step_idx_tensor but the kernel now expects seq_lens_encoder. While PaddlePaddle binds by position so this may not crash, the reference implementation computeOrderKernel (line 23-45) still uses step_idx == 1 logic while the actual CUDA kernel now uses seq_lens_encoder > 0 logic. The test should be updated to: 1) pass a proper seq_lens_encoder tensor, and 2) update the reference implementation to match the new kernel semantics.
Motivation
支持MTP(不开启logprob)场景下开启overlap schedule
Modifications
核心优化
注意事项
GLM TP4 效果
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.