[Feature] Support mtp overlap schedule by Sunny-bot1 · Pull Request #7001 · PaddlePaddle/FastDeploy

Sunny-bot1 · 2026-03-24T14:06:03Z

Motivation

支持MTP（不开启logprob）场景下开启overlap schedule

Modifications

核心优化

通过预先计算或复用历史值来避免同步开销
将同步拷贝改为异步，延迟数据传输到非关键路径

位置	算子	同步操作	解决方法
gpu_model_runner.py 前处理	unified_update_model_status	not_need_stop	废弃
gpu_model_runner.py 后处理	save_output	accept_tokens_cpu, accept_num_cpu, seq_lens_decoder_cpu, prompt_lens_cpu	异步拷贝，延迟传输，与非MTP统一位置
mtp.py 前处理	draft_model_preprocess	not_need_stop	废弃
mtp.py 前处理	eagle_get_hidden_states（output_token_num）	self._mtp_input_token_num_event.synchronize()	直接使用token_num_cpu
mtp.py 前处理	eagle_get_self_hidden_states(多步)	output_token_num_cpu	直接使用token_num_cpu
mtp.py 前处理	_propose_cuda（token_num_cpu）	self.model_inputs["seq_lens_this_time"].numpy().sum().item()	使用主模型传过来的当前轮的real_bsz (self.share_inputs["seq_lens_this_time_cpu"].numpy() > 0).sum())
mtp.py 前处理	exist_prefill()	np.any(self.share_inputs["seq_lens_encoder"].numpy() > 0)	self.exist_prefill_flag
mtp.py 后处理	pre_process (real_output_token_num)	self._draft_output_token_num_event.synchronize()	直接使用token_num_cpu
mtp.py 后处理	draft_model_update	not_need_stop	废弃
gpu_model_runner.py 后处理	speculate_schedule_cache	not_need_stop	废弃
gpu_model_runner.py 后处理	pre_process (real_output_token_num)	self.output_token_num_event.synchronize()	直接使用self.share_inputs["seq_lens_this_time_cpu"].numpy() > 0).sum()
worker_process.py	tp_barrier	all_reduce	开启overlap schedule后自动使用CPU barrier
cudagraph_piecewise_backend.py	call（num_running_requests）	num_running_requests = int((seq_lens_this_time.flatten() > 0).sum().item())	主模型复用上一轮的real_bsz((seq_lens_this_time>0).sum)，MTP使用当前轮的real_bsz((seq_lens_this_time>0).sum)

注意事项

以上优化仅在 decode batch 生效，prefill/mixed 阶段保持原有逻辑
在 overlap schedule 场景下，由于空间预分配会引入无效槽位，算子中通过判断 batch_id_per_token < 0 提前退出处理
不开启 overlap 时，保持原有逻辑

GLM TP4 效果

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-03-24T14:06:14Z

Thanks for your contribution!

codecov-commenter · 2026-03-24T16:44:15Z

Codecov Report

❌ Patch coverage is 78.37838% with 24 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@e87ce4b). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/spec_decode/mtp.py	76.74%	9 Missing and 1 partial ⚠️
fastdeploy/worker/input_batch.py	45.45%	6 Missing ⚠️
fastdeploy/model_executor/pre_and_post_process.py	54.54%	4 Missing and 1 partial ⚠️
fastdeploy/worker/gpu_model_runner.py	92.50%	1 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7001   +/-   ##
==========================================
  Coverage           ?   71.84%           
==========================================
  Files              ?      399           
  Lines              ?    56105           
  Branches           ?     8854           
==========================================
  Hits               ?    40306           
  Misses             ?    12945           
  Partials           ?     2854

Flag	Coverage Δ
GPU	`71.84% <78.37%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

fastdeploy-bot

AI CI Agent | skill: pr_review_agent

This PR implements MTP (Multi-Token Prediction) overlap scheduling support with significant refactoring of CUDA kernels to use cooperative groups for better parallelization. The changes include removing CPU-GPU copies for better performance, adding defensive checks for negative batch indices, and modifying API signatures.

I found 1 P1 logic bug that needs to be addressed before merging.

fastdeploy-bot · 2026-03-25T02:17:25Z

fastdeploy/spec_decode/mtp.py

@@ -1167,6 +1196,17 @@ def _get_self_hidden_states(self, hidden_states):
        )


P1 - API Mismatch Bug: The eagle_get_self_hidden_states kernel API was changed to expect seq_lens_encoder as the 4th parameter (see eagle_get_self_hidden_states.cu line 225), but this XPU branch is still passing step_idx. The kernel now uses seq_lens_encoder[t] > 0 to detect encoder phase instead of the old step_idx[i] == 1 check. This will produce incorrect results on XPU platform. Should be self.model_inputs.last_seq_lens_encoder (similar to the CUDA branch at line 1205) instead of self.model_inputs["step_idx"].

fastdeploy-bot · 2026-03-25T02:17:25Z

tests/operators/test_eagle_get_self_hidden_states.py

@@ -91,19 +91,20 @@ def test_eagle_get_self_hidden_states(self):



P1 - Test Inconsistency: The test passes step_idx_tensor but the kernel now expects seq_lens_encoder. While PaddlePaddle binds by position so this may not crash, the reference implementation computeOrderKernel (line 23-45) still uses step_idx == 1 logic while the actual CUDA kernel now uses seq_lens_encoder > 0 logic. The test should be updated to: 1) pass a proper seq_lens_encoder tensor, and 2) update the reference implementation to match the new kernel semantics.

huicongyao and others added 8 commits March 24, 2026 10:36

reformat eagle_get_hidden_states & eagle_get_self_hidden_states

f489c49

readibility

a2977e2

fix xpu bug

ebc2192

fix coverage failure

1d87765

change luanch params & parallelize position_map compute

1f187b5

Fix MTP-related bugs in FastDeploy centralized inference

3d96997

fix

93d74ed

mtp overlap

40487d1

Sunny-bot1 temporarily deployed to Metax_ci March 24, 2026 14:06 — with GitHub Actions Inactive

fastdeploy-bot reviewed Mar 25, 2026

View reviewed changes

fix

4060dfe

Sunny-bot1 temporarily deployed to Metax_ci March 25, 2026 04:15 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support mtp overlap schedule#7001

[Feature] Support mtp overlap schedule#7001
Sunny-bot1 wants to merge 9 commits intoPaddlePaddle:developfrom
Sunny-bot1:mtp_merge

Sunny-bot1 commented Mar 24, 2026 •

edited

Loading

Uh oh!

paddle-bot bot commented Mar 24, 2026

Uh oh!

codecov-commenter commented Mar 24, 2026 •

edited

Loading

Uh oh!

fastdeploy-bot left a comment

Uh oh!

fastdeploy-bot Mar 25, 2026

Uh oh!

fastdeploy-bot Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		@@ -1167,6 +1196,17 @@ def _get_self_hidden_states(self, hidden_states):
		)

		@@ -91,19 +91,20 @@ def test_eagle_get_self_hidden_states(self):

Conversation

Sunny-bot1 commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Mar 24, 2026

Uh oh!

codecov-commenter commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

fastdeploy-bot left a comment

Choose a reason for hiding this comment

Uh oh!

fastdeploy-bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

fastdeploy-bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Sunny-bot1 commented Mar 24, 2026 •

edited

Loading

codecov-commenter commented Mar 24, 2026 •

edited

Loading