[KVCache][BugFix] fix cache_manager_v1 allocating kv cache with wrong dtype when kv_cache_quant_type is set by kevincheng2 · Pull Request #7757 · PaddlePaddle/FastDeploy

kevincheng2 · 2026-05-09T06:05:53Z

问题描述

当 enable_cache_manager_v1=True 且配置了 KV cache 量化（如 kv_cache_quant_type=int8）时，CacheController.initialize_kv_cache 和 initialize_mtp_kv_cache 仍然使用模型计算精度（bfloat16）分配 KV cache tensor，而不是量化所需的 uint8。

这导致 C++ 算子 append_attention_gpu 崩溃：

RuntimeError: (InvalidArgument) The type of data we are trying to retrieve (uint8) does not match the type of data (bfloat16) currently contained in the container.
[Hint: Expected dtype() == phi::CppTypeToDataType<T>::Type(), but received dtype():16 != phi::CppTypeToDataType<T>::Type():2.]

根因：attention layer 的 cache_quant_type_str 被正确设为 "cache_int8"，C++ kernel 按 uint8_t* 访问 cache；但 v1 cache controller 分配的 cache 是 bfloat16，产生 dtype 不匹配。

修复

在分配 GPU cache tensor 前，根据 kv_cache_quant_type 确定正确的 cache_dtype：

有量化类型（int8/fp8 等）→ 使用 "uint8"（所有量化类型均为 1 byte 存储）
无量化类型 → 使用 model_config.dtype（如 bfloat16）

与 gpu_model_runner.py 非 v1 代码路径保持一致：

if self.quant_config.kv_cache_quant_type is not None:
    cache_type = "uint8"

影响范围

CacheController.initialize_kv_cache()
CacheController.initialize_mtp_kv_cache()
不影响 initialize_host_cache()（已正确使用 cache_config.cache_dtype）

复现条件

enable_cache_manager_v1=True + kv_cache_quant_type 配置为任意量化类型（int8/fp8 等）时启动服务即可复现。

… dtype when kv_cache_quant_type is set When enable_cache_manager_v1=True and kv_cache_quant_type is configured (e.g., int8), cache_controller.v1 was allocating KV cache tensors using model compute dtype (bfloat16) instead of uint8. This caused a C++ dtype mismatch crash in append_attention_gpu because the attention kernel accesses int8/fp8 quantized caches as uint8_t* internally. Fix: use "uint8" as the cache allocation dtype whenever kv_cache_quant_type is not None, consistent with how gpu_model_runner handles this in the non-v1 code path. Affected: initialize_kv_cache() and initialize_mtp_kv_cache() in CacheController.

paddle-bot · 2026-05-09T06:05:59Z

Thanks for your contribution!

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-09 14:17:05

📋 Review 摘要

PR 概述：修复 v1 CacheController 在启用 KV cache 量化时使用错误 dtype（bfloat16 而非 uint8）分配 GPU KV cache tensor 的 bug
变更范围：fastdeploy/cache_manager/v1/cache_controller.py
影响面 Tag：[KVCache]

📝 PR 规范检查

存在两处规范问题：① 标题含两个 Tag（[KVCache][BugFix]），规范要求仅能包含一个官方 Tag；② PR 描述未使用标准模板结构（缺少 ## Motivation、## Modifications、## Usage or Command、## Accuracy Tests、## Checklist 等必填 section）。

标题建议（可直接复制）：

[BugFix] fix cache_manager_v1 allocating kv cache with wrong dtype when kv_cache_quant_type is set

PR 描述建议（可直接复制，必须复刻 checklist §D2 模板的完整结构）：

## Motivation
在 `enable_cache_manager_v1=True` 且配置了 KV cache 量化（如 `kv_cache_quant_type=int8`）时，`CacheController.initialize_kv_cache` 和 `initialize_mtp_kv_cache` 仍使用模型计算精度（如 `bfloat16`）分配 KV cache tensor，而非量化所需的 `uint8`，导致 C++ 算子 `append_attention_gpu` 因 dtype 不匹配崩溃：`Expected dtype() == phi::CppTypeToDataType<T>::Type(), but received dtype():16 != dtype():2`。

## Modifications
- `fastdeploy/cache_manager/v1/cache_controller.py`：在 `initialize_kv_cache` 和 `initialize_mtp_kv_cache` 中，分配 GPU KV cache tensor 前根据 `kv_cache_quant_type` 确定 `cache_dtype`（有量化配置时使用 `"uint8"`，无则使用 `self.model_config.dtype`），与 `gpu_model_runner.py` 非 v1 路径保持一致。

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别	文件	概述
🟡 建议	`cache_manager/v1/cache_controller.py`	修复未附带单测，建议在 `tests/cache_manager/` 下补充覆盖量化 dtype 分配的测试用例

总体评价

修复逻辑正确，与非 v1 路径保持一致，initialize_kv_cache 和 initialize_mtp_kv_cache 两处均做了同步修复，逻辑简洁清晰。建议补充单测并规范 PR 描述后合入。

PaddlePaddle-bot · 2026-05-09T06:34:10Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-09 14:33:17

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 189c2eb
Merge base: 85f1cb2 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

⏳ CI 进行中 — 0 个 Required 任务失败，4 个 Required 任务仍在运行中，请等待结果。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
35(0)	35	27	2	5	1	0

2 任务状态汇总

2.1 Required任务 : 6/10 通过

必选任务阻塞合并，失败需优先处理。当前无失败，4 个任务运行中。

状态	任务	耗时	根因	修复建议	日志	重跑
⏳	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	-	运行中	-	Job	-
⏳	`Extracted partial CE model tasks to run in CI. / run_ce_cases`	-	运行中	-	Job	-
⏳	`Run Four Cards Tests / run_4_cards_tests`	-	运行中	-	Job	-
⏳	`xpu_4cards_case_test / run_xpu_4cards_cases`	-	运行中	-	Job	-
✅	其余 6 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 21/25 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Check PR Template`	13s	Job	-
❌	`Trigger Jenkins for PR`	19m54s	Job	-
⏳	`Run iluvatar Tests / run_iluvatar_cases`	-	Job	-
⏸️	`CI_HPU`	-	-	-
✅	其余 21 个可选任务通过	-	-	-

3 失败详情（仅 required）

无 required 失败任务。

codecov-commenter · 2026-05-09T07:32:53Z

Codecov Report

❌ Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@85f1cb2). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/cache_manager/v1/cache_controller.py	0.00%	6 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7757   +/-   ##
==========================================
  Coverage           ?   71.59%           
==========================================
  Files              ?      396           
  Lines              ?    55596           
  Branches           ?     8691           
==========================================
  Hits               ?    39804           
  Misses             ?    13052           
  Partials           ?     2740

Flag	Coverage Δ
GPU	`71.59% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

kevincheng2 had a problem deploying to Metax_ci May 9, 2026 06:05 — with GitHub Actions Failure

PaddlePaddle-bot reviewed May 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KVCache][BugFix] fix cache_manager_v1 allocating kv cache with wrong dtype when kv_cache_quant_type is set#7757

[KVCache][BugFix] fix cache_manager_v1 allocating kv cache with wrong dtype when kv_cache_quant_type is set#7757
kevincheng2 wants to merge 1 commit intoPaddlePaddle:developfrom
kevincheng2:fix/cache-controller-v1-kv-quant-dtype

kevincheng2 commented May 9, 2026

Uh oh!

paddle-bot Bot commented May 9, 2026

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot commented May 9, 2026

Uh oh!

codecov-commenter commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kevincheng2 commented May 9, 2026

问题描述

修复

影响范围

复现条件

Uh oh!

paddle-bot Bot commented May 9, 2026

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot commented May 9, 2026

1 任务总览

2 任务状态汇总

2.1 Required任务 : 6/10 通过

2.2 可选任务 — 21/25 通过

3 失败详情（仅 required）

Uh oh!

codecov-commenter commented May 9, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants