[BugFix][KVCache][Speculative Decoding] Fix get_max_chunk_tokens for PD-split decode node in MTP scenario#7756
Conversation
…PD-split decode node in MTP scenario
|
Thanks for your contribution! |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-09 13:39:36
📋 Review 摘要
PR 概述:修复 PD 分离 + MTP 场景下 D 节点 get_max_chunk_tokens() 未乘以投机解码 token 倍数导致 KV Cache 分配不足的 Bug。
变更范围:fastdeploy/config.py
影响面 Tag:[BugFix] [KVCache] [Speculative Decoding] [PD Disaggregation]
📝 PR 规范检查
存在两处规范问题:①标题包含多个官方 Tag(规范要求仅含一个);②PR 描述缺少 ## Accuracy Tests 必填段落。
标题建议(可直接复制):
[BugFix] Fix get_max_chunk_tokens for PD-split decode node in MTP scenario
PR 描述建议(可直接复制,已补全所有必填段落):
## Motivation
在 PD 分离 + MTP(Multi-Token Prediction)场景下,D 节点的 `get_max_chunk_tokens()` 返回值计算有误。
当前代码对 D 节点直接返回 `max_num_seqs`,未考虑 MTP 场景下每条 sequence 每个 decode step 实际处理 `num_speculative_tokens + 1` 个 token 的情况,导致:
1. **`cache_config.postprocess` KV cache 规划不足**:`get_max_chunk_tokens()` 用于 cache 管理策略,返回值偏小会导致 D 节点在 MTP 推理时出现 block 分配异常或 OOM。
2. **`profile_run` 内存估算偏低**:profile run 使用该值决定 dummy 输入的 token 数,低估会导致显存峰值评估不准确。
## Modifications
`fastdeploy/config.py` `get_max_chunk_tokens()` 方法中,D 节点 non-XPU 分支:
- 修改前:`num_tokens = self.scheduler_config.max_num_seqs`
- 修改后:`num_tokens = self.scheduler_config.max_num_seqs * mtp_steps`
其中 `mtp_steps = num_speculative_tokens + 1`(非 MTP 时为 1,完全向后兼容),与同文件 `_check_max_num_batched_tokens` 中 `tokens_per_seq` 的计算逻辑保持一致。
## Usage or Command
PD 分离 + MTP 场景下启动 D 节点,配置示例:
```bash
# 启动 D 节点,开启 MTP 推理
python -m fastdeploy.entrypoints.openai.api_server \
--model /path/to/model \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' \
--splitwise-config '{"role": "decode"}' \
--max-num-seqs 100 \
...
```
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 📝 PR 规范 | — | 标题含多个 Tag;描述缺 ## Accuracy Tests 段落 |
| ❓ 疑问 | fastdeploy/config.py:2513 |
method is not None 涵盖所有投机解码方法,与注释 "In MTP scenario" 语义不一致 |
❓ 疑问 — config.py 条件判断与注释不一致
当前条件 self.speculative_config.method is not None 涵盖所有投机解码方法(ngram、suffix 等),但上方注释写的是 # In MTP scenario,两者语义不一致,请作者确认:
- 若所有投机解码方法均需要
num_speculative_tokens + 1乘数(即 D 节点处理 draft token + 验证 token),建议将注释修改为通用描述:# In speculative decoding scenario, each sequence processes (num_speculative_tokens + 1) # tokens per decode step (draft tokens + 1 verification token)
- 若此修复仅适用于 MTP 方法,则条件应收紧为:
if self.speculative_config is not None and getattr(self.speculative_config, "method", None) == "mtp"
总体评价
修复方向正确,向后兼容性良好(num_speculative_tokens=0 时退化为原逻辑)。需作者确认 method is not None 条件是否应限定为 MTP 专用,并补全 PR 描述中缺失的 ## Accuracy Tests 段落。
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览⏳ CI 进行中:必选任务 6/8 已通过,2 个仍在运行,暂无 required 失败。
2 任务状态汇总2.1 Required任务 : 6/8 通过
2.2 可选任务 — 18/23 通过
3 失败详情(仅 required)无 |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #7756 +/- ##
==========================================
Coverage ? 72.20%
==========================================
Files ? 396
Lines ? 55595
Branches ? 8691
==========================================
Hits ? 40141
Misses ? 12690
Partials ? 2764
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
✅ Cherry-pick successful! Created PR: #7758 |
Motivation
在 PD 分离 + MTP(Multi-Token Prediction)场景下,D 节点的
get_max_chunk_tokens()返回值计算有误。当前代码对 D 节点直接返回
max_num_seqs,未考虑 MTP 场景下每条 sequence 每个 decode step 实际处理num_speculative_tokens + 1个 token 的情况Modifications
fastdeploy/config.pyget_max_chunk_tokens()方法中,D 节点 non-XPU 分支:num_tokens = self.scheduler_config.max_num_seqsnum_tokens = self.scheduler_config.max_num_seqs * mtp_steps其中
mtp_steps = num_speculative_tokens + 1(非 MTP 时为 1,完全向后兼容),与同文件_check_max_num_batched_tokens中tokens_per_seq的计算逻辑保持一致。Usage or Command
Checklist
pre-commitbefore commit.