[Speculative Decoding] fix pd-split metrics and support other model runner by freeliuzc · Pull Request #6995 · PaddlePaddle/FastDeploy

freeliuzc · 2026-03-24T13:03:37Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-03-24T13:03:51Z

Thanks for your contribution!

Copilot

Pull request overview

该 PR 旨在改进 Speculative Decoding（尤其是 MTP）在 PD 拆分场景下的 metrics 记录，并尝试兼容特定 runner（通过环境变量开关）下的执行路径，以避免 EP/空输入等场景的异常行为。

Changes:

在 GPUModelRunner 的特定分支中新增日志与 MTP 空输入 forward 兜底执行逻辑。
在 MTP proposer 中引入环境变量开关，控制多模态相关 attn_mask_offsets 的处理路径。
在 PD 拆分的资源管理流程中修正/隔离 metrics 对象并补充 decoder 侧起始时间戳字段与更新方法。

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
fastdeploy/worker/gpu_model_runner.py	增加 speculative decoding 场景下的日志；在无有效输入/输出时为 MTP+EP 增加空输入 forward 的兜底执行
fastdeploy/spec_decode/mtp.py	增加 EB5_ENABLE_FD_RUNNER 环境变量开关，并在 CUDA 路径/插入任务时条件性跳过多模态 mask offsets 的构造与更新
fastdeploy/engine/sched/resource_manager_v1.py	PD decode 侧接收 prefilled request 时深拷贝 metrics 并补充 decoder 推理起始时间设置
fastdeploy/engine/request.py	RequestMetrics 增加 decoder engine 发送时间戳字段，并新增更新方法用于设置该字段

Comments suppressed due to low confidence (1)

fastdeploy/spec_decode/mtp.py:904

同上：_propose_cuda() 在 eb5_runner=true 时跳过 update_attn_mask_offsets，但 forward 仍会基于 ForwardMeta.attn_mask_offsets 执行注意力计算（只要 enable_mm=True）。这会让 attn_mask_offsets 可能保持上一次 batch 的值或默认未初始化值，从而污染本轮推理。建议在 eb5_runner 场景显式把 ForwardMeta.attn_mask_offsets 置空/禁用多模态 mask offsets，或保证每步都对 offsets 做正确刷新。

                if self.enable_mm and not self.eb5_runner:
                    attn_mask_offsets = update_attn_mask_offsets(
                        ids_remove_padding,
                        getattr(
                            self.model_inputs, "seq_lens_this_time", self.model_inputs["seq_lens_this_time_buffer"]
                        ),
                        self.model_inputs["seq_lens_encoder"],
                        self.model_inputs["seq_lens_decoder"],
                        cu_seqlens_q,
                        self.model_inputs["attn_mask_offsets_full"],
                        self.model_inputs["attn_mask_offsets_decoder"],
                        self.model_inputs["is_block_step"],
                        self.model_inputs["decode_states"],
                        self.model_inputs["mask_rollback"],
                    )
                    self.model_inputs["attn_mask_offsets"].copy_(attn_mask_offsets, False)

Copilot · 2026-03-24T13:09:39Z

fastdeploy/worker/gpu_model_runner.py

+                        logger.info(f"seq_lens_this_time: {length + 1}")
+                        logger.info(f"draft_tokens: {request.draft_token_ids}")


这里新增的 info 级别日志会在解码热路径打印每个请求的 draft_tokens（包含用户生成内容的 token id 列表），有明显的隐私/数据泄露风险，并且在高并发下会导致日志量爆炸、影响吞吐。建议删除该日志，或至少降为 debug 并只记录长度/摘要（例如 token 数、hash），同时用开关（如 deterministic logger / env flag）严格控制开启场景。

Copilot · 2026-03-24T13:09:40Z

fastdeploy/spec_decode/mtp.py

                    len(request.output_token_ids) if prefill_end_index >= len(input_ids) else 0
                )
-                if self.enable_mm:
+                if self.enable_mm and not self.eb5_runner:


这里在 eb5_runner=true 时跳过了多模态 attn_mask_offsets_full/decoder 的填充，但本类后续在 _initialize_forward_meta() 里仍然会在 enable_mm=True 时把 model_inputs["attn_mask_offsets"] 传入 ForwardMeta（即 attention backend 会使用该 tensor）。如果 offsets 没有被正确初始化/更新，很容易导致注意力 mask 错误、输出不正确。建议在 eb5_runner 场景下同时确保 ForwardMeta.attn_mask_offsets 为 None（或将 enable_mm 置为 False），或者提供等价的 offsets 计算/初始化逻辑，避免使用到陈旧/未定义的数据。

Copilot · 2026-03-24T13:09:40Z

fastdeploy/engine/request.py

+    def update_decoder_start_time(self):
+        self.llm_engine_send_req_to_decoder_engine_timestamp = self.decode_inference_start_time
+


PR 描述目前基本保留了模板内容，未补充本次修改的动机、具体改动点、使用方式/回归命令、以及（若影响输出）精度验证结果。为了便于评审与后续维护，建议按模板补全至少 Motivation/Modifications/Usage(or Command)/Accuracy Tests（如无测试也请说明原因）。

codecov-commenter · 2026-03-24T14:44:04Z

Codecov Report

❌ Patch coverage is 64.28571% with 5 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@522d12c). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/worker/gpu_model_runner.py	20.00%	3 Missing and 1 partial ⚠️
fastdeploy/spec_decode/mtp.py	66.66%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #6995   +/-   ##
==========================================
  Coverage           ?   73.26%           
==========================================
  Files              ?      399           
  Lines              ?    56056           
  Branches           ?     8851           
==========================================
  Hits               ?    41071           
  Misses             ?    12075           
  Partials           ?     2910

Flag	Coverage Δ
GPU	`73.26% <64.28%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

fix pd-split metrics and support other model runner

c78b9ed

Copilot AI review requested due to automatic review settings March 24, 2026 13:03

freeliuzc temporarily deployed to Metax_ci March 24, 2026 13:03 — with GitHub Actions Inactive

Copilot started reviewing on behalf of freeliuzc March 24, 2026 13:04 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

del print info

9fa1c03

freeliuzc temporarily deployed to Metax_ci March 24, 2026 16:11 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Speculative Decoding] fix pd-split metrics and support other model runner#6995

[Speculative Decoding] fix pd-split metrics and support other model runner#6995
freeliuzc wants to merge 2 commits intoPaddlePaddle:developfrom
freeliuzc:merge_mtp_support

freeliuzc commented Mar 24, 2026

Uh oh!

paddle-bot bot commented Mar 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

codecov-commenter commented Mar 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		logger.info(f"seq_lens_this_time: {length + 1}")
		logger.info(f"draft_tokens: {request.draft_token_ids}")

		def update_decoder_start_time(self):
		self.llm_engine_send_req_to_decoder_engine_timestamp = self.decode_inference_start_time

Conversation

freeliuzc commented Mar 24, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Mar 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Mar 24, 2026 •

edited

Loading