[KVCache][BugFix] fix cache_manager_v1 allocating kv cache with wrong dtype when kv_cache_quant_type is set#7757
Conversation
… dtype when kv_cache_quant_type is set When enable_cache_manager_v1=True and kv_cache_quant_type is configured (e.g., int8), cache_controller.v1 was allocating KV cache tensors using model compute dtype (bfloat16) instead of uint8. This caused a C++ dtype mismatch crash in append_attention_gpu because the attention kernel accesses int8/fp8 quantized caches as uint8_t* internally. Fix: use "uint8" as the cache allocation dtype whenever kv_cache_quant_type is not None, consistent with how gpu_model_runner handles this in the non-v1 code path. Affected: initialize_kv_cache() and initialize_mtp_kv_cache() in CacheController.
|
Thanks for your contribution! |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-09 14:17:05
📋 Review 摘要
PR 概述:修复 v1 CacheController 在启用 KV cache 量化时使用错误 dtype(bfloat16 而非 uint8)分配 GPU KV cache tensor 的 bug
变更范围:fastdeploy/cache_manager/v1/cache_controller.py
影响面 Tag:[KVCache]
📝 PR 规范检查
存在两处规范问题:① 标题含两个 Tag([KVCache][BugFix]),规范要求仅能包含一个官方 Tag;② PR 描述未使用标准模板结构(缺少 ## Motivation、## Modifications、## Usage or Command、## Accuracy Tests、## Checklist 等必填 section)。
标题建议(可直接复制):
[BugFix] fix cache_manager_v1 allocating kv cache with wrong dtype when kv_cache_quant_type is set
PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
## Motivation
在 `enable_cache_manager_v1=True` 且配置了 KV cache 量化(如 `kv_cache_quant_type=int8`)时,`CacheController.initialize_kv_cache` 和 `initialize_mtp_kv_cache` 仍使用模型计算精度(如 `bfloat16`)分配 KV cache tensor,而非量化所需的 `uint8`,导致 C++ 算子 `append_attention_gpu` 因 dtype 不匹配崩溃:`Expected dtype() == phi::CppTypeToDataType<T>::Type(), but received dtype():16 != dtype():2`。
## Modifications
- `fastdeploy/cache_manager/v1/cache_controller.py`:在 `initialize_kv_cache` 和 `initialize_mtp_kv_cache` 中,分配 GPU KV cache tensor 前根据 `kv_cache_quant_type` 确定 `cache_dtype`(有量化配置时使用 `"uint8"`,无则使用 `self.model_config.dtype`),与 `gpu_model_runner.py` 非 v1 路径保持一致。
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | cache_manager/v1/cache_controller.py |
修复未附带单测,建议在 tests/cache_manager/ 下补充覆盖量化 dtype 分配的测试用例 |
总体评价
修复逻辑正确,与非 v1 路径保持一致,initialize_kv_cache 和 initialize_mtp_kv_cache 两处均做了同步修复,逻辑简洁清晰。建议补充单测并规范 PR 描述后合入。
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览⏳ CI 进行中 — 0 个 Required 任务失败,4 个 Required 任务仍在运行中,请等待结果。
2 任务状态汇总2.1 Required任务 : 6/10 通过
2.2 可选任务 — 21/25 通过
3 失败详情(仅 required)无 required 失败任务。 |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7757 +/- ##
==========================================
Coverage ? 71.59%
==========================================
Files ? 396
Lines ? 55596
Branches ? 8691
==========================================
Hits ? 39804
Misses ? 13052
Partials ? 2740
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
问题描述
当
enable_cache_manager_v1=True且配置了 KV cache 量化(如kv_cache_quant_type=int8)时,CacheController.initialize_kv_cache和initialize_mtp_kv_cache仍然使用模型计算精度(bfloat16)分配 KV cache tensor,而不是量化所需的uint8。这导致 C++ 算子
append_attention_gpu崩溃:根因:attention layer 的
cache_quant_type_str被正确设为"cache_int8",C++ kernel 按uint8_t*访问 cache;但 v1 cache controller 分配的 cache 是bfloat16,产生 dtype 不匹配。修复
在分配 GPU cache tensor 前,根据
kv_cache_quant_type确定正确的cache_dtype:"uint8"(所有量化类型均为 1 byte 存储)model_config.dtype(如 bfloat16)与
gpu_model_runner.py非 v1 代码路径保持一致:影响范围
CacheController.initialize_kv_cache()CacheController.initialize_mtp_kv_cache()initialize_host_cache()(已正确使用cache_config.cache_dtype)复现条件
enable_cache_manager_v1=True+kv_cache_quant_type配置为任意量化类型(int8/fp8 等)时启动服务即可复现。