[XPU] Add swap_cache_layout op to support Mooncake KV cache for XPU.#7728
[XPU] Add swap_cache_layout op to support Mooncake KV cache for XPU.#7728Jiajun-Ji wants to merge 2 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
该 PR 在 XPU 路径新增 swap_cache_layout 自定义算子,用于在 XPU KV cache(按 layer 存放)与 CPU pinned buffer(按 block-major、layer-minor 存放)之间进行布局转换与拷贝,从而支持 Mooncake 作为 KV cache storage backend 的 XPU 场景。
Changes:
- Mooncake 配置在 CUDA/XPU 平台下默认自动探测并填充 RDMA NICs。
- cache_manager 的 XPU ops 导出
swap_cache_layout,供存储读写路径使用。 - 新增 XPU 自定义算子实现
swap_cache_layout及其对应测试用例。
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| fastdeploy/cache_manager/transfer_factory/mooncake_store/mooncake_store.py | XPU 场景下也支持自动选择 RDMA 设备配置 |
| fastdeploy/cache_manager/ops.py | XPU 平台导入并暴露 swap_cache_layout |
| custom_ops/xpu_ops/src/ops/swap_cache_layout.cc | 新增 XPU swap_cache_layout 算子实现(XPU↔CPU pinned buffer 布局转换拷贝) |
| custom_ops/xpu_ops/test/test_swap_cache_layout.py | 新增 swap_cache_layout 的 roundtrip/性能测试用例 |
| auto* cache_cpu_ptr = reinterpret_cast<T*>(cache_cpu_pointer); | ||
|
|
||
| for (int block_idx = 0; block_idx < static_cast<int>(xpu_block_ids.size()); | ||
| block_idx++) { | ||
| auto cur_xpu_block_id = xpu_block_ids[block_idx]; |
| for (int i = 1; i < static_cast<int>(cache_shape.size()); i++) { | ||
| cache_block_stride *= cache_shape[i]; | ||
| } | ||
|
|
| mode, | ||
| ) | ||
| paddle.device.synchronize() | ||
| cost_time = time.time() - start | ||
| print( | ||
| f"swap cache layout ({label}), total_gb: {total_gb:.6f}GB, " | ||
| f"cost_time: {cost_time:.6f}s, speed: {total_gb / cost_time:.6f}GB/s" | ||
| ) | ||
|
|
||
| def test_performance(self): | ||
| for _ in range(3): |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览
2 任务状态汇总2.1 Required任务 : 8/10 通过
2.2 可选任务 — 23/26 通过
3 失败详情(仅 required)Approval — 代码规范(custom op 审批)(置信度: 高)Approval
根因详情: 关键日志: 修复建议:
修复建议摘要: 请 FastDeploy RD + PaddlePaddle RD 各 1 人在 PR 上 Approve 关联变更: PR 标题 链接: 查看日志 |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-07 11:13:41
📋 Review 摘要
PR 概述:为 XPU 平台新增 swap_cache_layout custom op,实现 XPU KV cache 与 CPU pinned memory 的 layout 转换,并将 mooncake_store 的 RDMA NIC 自动探测扩展到 XPU 平台
变更范围:custom_ops/xpu_ops/src/ops/、fastdeploy/cache_manager/
影响面 Tag:[XPU] [KVCache] [OP]
📝 PR 规范检查
## Modifications 段落内容为空(仅保留模板占位注释),Checklist 条目均未按实际情况勾选。
标题建议(可直接复制):
[XPU][KVCache] Add swap_cache_layout op to support Mooncake KV cache for XPU
PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
## Motivation
添加 swap_cache_layout op 以支持 mooncake,在 _run_read_storage 中调用以在 XPU KV cache 与 CPU pinned memory 之间执行 layout swap。mooncake 原生不支持 XPU,在 CPU 内存空间下连接 mooncake 后端。
## Modifications
- 新增 `custom_ops/xpu_ops/src/ops/swap_cache_layout.cc`:实现 XPU KV cache(layout: `[block_num, head_num, block_size, head_dim]`)与 CPU pinned memory(layout: `[block_num, layer_num, head_num, block_size, head_dim]`)之间的数据搬移 op,支持 mode=0(XPU→CPU)和 mode=1(CPU→XPU)两个方向
- 新增 `custom_ops/xpu_ops/test/test_swap_cache_layout.py`:涵盖 roundtrip 正确性验证和 XPU↔CPU 带宽性能测试
- `fastdeploy/cache_manager/ops.py`:修复 XPU 平台下 `swap_cache_layout` 被错误置为 `None` 的问题,改为从 `fastdeploy.model_executor.ops.xpu` 正确导入
- `fastdeploy/cache_manager/transfer_factory/mooncake_store/mooncake_store.py`:将 RDMA NIC 自动探测逻辑(`get_rdma_nics()`)扩展到 XPU 平台
## Usage or Command
参见 PR 描述中的 shell 启动脚本(启动 Mooncake Master + 双 XPU 实例 + 验证请求)。
## Accuracy Tests
实例 1 写入 mooncake,实例 2 命中 mooncake 缓存(截图已在 PR 描述中提供)。
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | custom_ops/xpu_ops/src/ops/swap_cache_layout.cc:74 |
xpu_memcpy 在 layer×block 双重循环中逐次同步调用,大模型场景下串行 XDMA 调用较多 |
总体评价
实现思路清晰,修复了 XPU 平台下 swap_cache_layout 被错误置为 None 的遗留问题,功能完整、roundtrip 测试和性能测试覆盖到位。仅 ## Modifications 段落为空,建议补全以便追溯。
| void* src = (mode == 0) ? static_cast<void*>(xpu_ptr_now) | ||
| : static_cast<void*>(cpu_ptr_now); | ||
|
|
||
| int ret = xpu_memcpy(dst, src, cache_block_stride * sizeof(T), copy_kind); |
There was a problem hiding this comment.
🟡 建议 xpu_memcpy 在 layer_num × block_num 双重循环中逐次同步调用。
对于大模型(32+ 层、多 block 场景),会产生大量串行 XDMA 调用,可能成为吞吐瓶颈。建议评估 XPU runtime 是否支持流式/异步 memcpy 批量提交,或在同一层内批量提交多个 block 的传输请求以提升并发度。
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7728 +/- ##
==========================================
Coverage ? 71.60%
==========================================
Files ? 396
Lines ? 55568
Branches ? 8688
==========================================
Hits ? 39791
Misses ? 13039
Partials ? 2738
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Motivation
添加swap_cache_layout op以支持mooncake,在_run_read_storage调用交换 xpu kv cache到cpu pinned memory。
mooncake原生不支持XPU,在CPU内存空间下连接mooncake后端。
Modifications
Usage or Command
Accuracy Tests
config


cache_manager.log
实例1写入mooncake

实例2接受mooncake

Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.