[Speculative Decoding] fix pd-split metrics and support other model runner#6995
[Speculative Decoding] fix pd-split metrics and support other model runner#6995freeliuzc wants to merge 2 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
该 PR 旨在改进 Speculative Decoding(尤其是 MTP)在 PD 拆分场景下的 metrics 记录,并尝试兼容特定 runner(通过环境变量开关)下的执行路径,以避免 EP/空输入等场景的异常行为。
Changes:
- 在 GPUModelRunner 的特定分支中新增日志与 MTP 空输入 forward 兜底执行逻辑。
- 在 MTP proposer 中引入环境变量开关,控制多模态相关 attn_mask_offsets 的处理路径。
- 在 PD 拆分的资源管理流程中修正/隔离 metrics 对象并补充 decoder 侧起始时间戳字段与更新方法。
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| fastdeploy/worker/gpu_model_runner.py | 增加 speculative decoding 场景下的日志;在无有效输入/输出时为 MTP+EP 增加空输入 forward 的兜底执行 |
| fastdeploy/spec_decode/mtp.py | 增加 EB5_ENABLE_FD_RUNNER 环境变量开关,并在 CUDA 路径/插入任务时条件性跳过多模态 mask offsets 的构造与更新 |
| fastdeploy/engine/sched/resource_manager_v1.py | PD decode 侧接收 prefilled request 时深拷贝 metrics 并补充 decoder 推理起始时间设置 |
| fastdeploy/engine/request.py | RequestMetrics 增加 decoder engine 发送时间戳字段,并新增更新方法用于设置该字段 |
Comments suppressed due to low confidence (1)
fastdeploy/spec_decode/mtp.py:904
- 同上:_propose_cuda() 在 eb5_runner=true 时跳过 update_attn_mask_offsets,但 forward 仍会基于 ForwardMeta.attn_mask_offsets 执行注意力计算(只要 enable_mm=True)。这会让 attn_mask_offsets 可能保持上一次 batch 的值或默认未初始化值,从而污染本轮推理。建议在 eb5_runner 场景显式把 ForwardMeta.attn_mask_offsets 置空/禁用多模态 mask offsets,或保证每步都对 offsets 做正确刷新。
if self.enable_mm and not self.eb5_runner:
attn_mask_offsets = update_attn_mask_offsets(
ids_remove_padding,
getattr(
self.model_inputs, "seq_lens_this_time", self.model_inputs["seq_lens_this_time_buffer"]
),
self.model_inputs["seq_lens_encoder"],
self.model_inputs["seq_lens_decoder"],
cu_seqlens_q,
self.model_inputs["attn_mask_offsets_full"],
self.model_inputs["attn_mask_offsets_decoder"],
self.model_inputs["is_block_step"],
self.model_inputs["decode_states"],
self.model_inputs["mask_rollback"],
)
self.model_inputs["attn_mask_offsets"].copy_(attn_mask_offsets, False)
| logger.info(f"seq_lens_this_time: {length + 1}") | ||
| logger.info(f"draft_tokens: {request.draft_token_ids}") |
There was a problem hiding this comment.
这里新增的 info 级别日志会在解码热路径打印每个请求的 draft_tokens(包含用户生成内容的 token id 列表),有明显的隐私/数据泄露风险,并且在高并发下会导致日志量爆炸、影响吞吐。建议删除该日志,或至少降为 debug 并只记录长度/摘要(例如 token 数、hash),同时用开关(如 deterministic logger / env flag)严格控制开启场景。
| len(request.output_token_ids) if prefill_end_index >= len(input_ids) else 0 | ||
| ) | ||
| if self.enable_mm: | ||
| if self.enable_mm and not self.eb5_runner: |
There was a problem hiding this comment.
这里在 eb5_runner=true 时跳过了多模态 attn_mask_offsets_full/decoder 的填充,但本类后续在 _initialize_forward_meta() 里仍然会在 enable_mm=True 时把 model_inputs["attn_mask_offsets"] 传入 ForwardMeta(即 attention backend 会使用该 tensor)。如果 offsets 没有被正确初始化/更新,很容易导致注意力 mask 错误、输出不正确。建议在 eb5_runner 场景下同时确保 ForwardMeta.attn_mask_offsets 为 None(或将 enable_mm 置为 False),或者提供等价的 offsets 计算/初始化逻辑,避免使用到陈旧/未定义的数据。
| def update_decoder_start_time(self): | ||
| self.llm_engine_send_req_to_decoder_engine_timestamp = self.decode_inference_start_time | ||
|
|
There was a problem hiding this comment.
PR 描述目前基本保留了模板内容,未补充本次修改的动机、具体改动点、使用方式/回归命令、以及(若影响输出)精度验证结果。为了便于评审与后续维护,建议按模板补全至少 Motivation/Modifications/Usage(or Command)/Accuracy Tests(如无测试也请说明原因)。
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #6995 +/- ##
==========================================
Coverage ? 73.26%
==========================================
Files ? 399
Lines ? 56056
Branches ? 8851
==========================================
Hits ? 41071
Misses ? 12075
Partials ? 2910
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.