Skip to content

[not merge] xpu glm test#7748

Open
zhupengyang wants to merge 5 commits intoPaddlePaddle:developfrom
zhupengyang:glm_docker_1_merge
Open

[not merge] xpu glm test#7748
zhupengyang wants to merge 5 commits intoPaddlePaddle:developfrom
zhupengyang:glm_docker_1_merge

Conversation

@zhupengyang
Copy link
Copy Markdown
Collaborator

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 8, 2026

Thanks for your contribution!

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-08 11:06:22

📋 Review 摘要

PR 概述:XPU GLM 测试相关修复,涵盖 KV Cache 传输流程调整、前缀缓存管理健壮性提升、采样器 XPU 适配、本地调度器 bug 修复及 PD 分离控制命令扩展。

变更范围cache_manager/model_executor/layers/sample/scheduler/splitwise/

影响面 Tag[KVCache] [Scheduler] [XPU] [PD Disaggregation] [OP]

📝 PR 规范检查

PR 标题 [not merge] xpu glm test 缺少官方 Tag,[not merge] 不属于规范 Tag 列表;所有描述章节均为空占位符,未填写任何内容。

标题建议(可直接复制):

  • [XPU][BugFix] Fix XPU GLM sampling and KVCache transfer issues

PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):

## Motivation
修复 XPU 平台上 GLM 模型推理时的采样器随机种子范围问题(XPU 仅支持 32-bit 整数范围)、KV Cache storage 传输流程中索引刷新时序问题及前缀缓存管理器在 GPU block 不足时抛异常的问题;同时修复本地调度器 `_recycle` 中游标更新 bug,并新增 PD 分离的 `interrupt_requests` 控制命令支持。

## Modifications
- `sampler.py`:XPU 平台使用 `MAX_INFER_SEED=2147483646`(32-bit 上限),其他平台保持原值
- `sampler.py``offsets` 计算从 `local_pos * 4` 改为 `local_pos * 32`
- `cache_transfer_manager.py`:将 `flush_token_index``write_back_storage_task``finally` 块移至 `_run_write_back_storage` 数据写入前
- `prefix_cache_manager.py`:GPU block 不足时改为 warning + 跳过,而非抛异常;storage 预取增加 try/except 保护;移除 `enable_output_caching` 条件判断,始终包含输出 token;新增对 token_ids 按 block 数截断
- `attention_store.py`:write 操作改为分片写入,支持超时控制和环境变量配置(`AS_WRITE_TOTAL_TIMEOUT``AS_WRITE_SLICE_BLOCK_NUM``AS_WRITE_SLICE_TIMEOUT`- `local_scheduler.py`:修复 `_recycle``ids_read_cursor` 仅在索引小于游标时才递减;修复批量过期 ID 清理时的索引错位 bug
- `splitwise/internal_adapter_utils.py`:新增 `interrupt_requests` 控制命令,调用 `add_abort_req_ids` 中断指定请求

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [ ] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别 文件 概述
🔴 Bug fastdeploy/model_executor/layers/sample/sampler.py:103 local_pos * 32 未做 XPU 平台门控,将影响所有平台的采样 offset 计算
🔴 Bug fastdeploy/cache_manager/cache_transfer_manager.py:935 flush_token_index 移至数据写入之前执行,存在索引先于数据写入完成的时序风险
🔴 Bug fastdeploy/cache_manager/prefix_cache_manager.py:1143 移除 enable_output_caching 条件判断后,当 flag=False 时 output token 仍被计入写入,改变了原有语义
🟡 建议 fastdeploy/scheduler/local_scheduler.py A4: _recycle 已修复,建议确认 global_schedulerdp_schedulersplitwise_scheduler 是否存在相同的游标更新 bug

总体评价

本 PR 包含多项有价值的修复(调度器游标 bug、存储预取异常兜底、attention_store 分片写入),但存在 3 处影响通用平台正确性的 P0 问题,需作者确认或修复后方可合入。此外 PR 标记 [not merge],正式合入前还需补全描述并通过完整的 XPU/GPU 回归测试。

offsets = paddle.where(
is_decoder,
local_pos * 4,
local_pos * 32,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug local_pos * 32 变更未做 XPU 平台门控,将影响所有平台(包括 CUDA/GPU)的采样 offset 计算。

同一函数中 MAX_INFER_SEED 的修改已正确使用 if current_platform.is_xpu() 门控,但此处缺少同样的保护。

如果 * 32 仅为 XPU 适配,建议修改为:

offsets = paddle.where(
    is_decoder,
    local_pos * 32 if current_platform.is_xpu() else local_pos * 4,
    paddle.zeros_like(local_pos),
)

如果该改动对所有平台都成立,请在 PR 描述中说明原因。

elif self.storage_backend_type == "attention_store":
try:
if (self.rank == 0) and self.storage_backend_type == "attention_store":
self.storage_backend.flush_token_index(task_id, token_ids, 0, False)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug flush_token_index 现在在数据写入之前执行,存在索引先于实际数据可用的时序窗口风险。

原来此调用位于 write_back_storage_taskfinally 块中,保证在整个写入流程(无论成功与否)完成后才刷新索引。移到写入之前意味着:若后续 sdk.write 失败,storage 中索引已更新但数据未写入,导致其他请求命中索引却读不到有效 KV Cache。

请确认此改动是否为 XPU 特定需求,若是,建议加上平台或后端判断:

if not current_platform.is_xpu():
    # flush after write (original behavior)

或在 PR 中说明新时序的设计意图。

input_token_ids = token_ids + request.output_token_ids
else:
input_token_ids = token_ids
input_token_ids = token_ids + request.output_token_ids
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug 移除 enable_output_caching 条件判断后,无论该 flag 是否开启,output_token_ids 都会被拼接到 token_ids 中参与存储写入。

enable_output_caching=False 时,原来只写入 input token 对应的 KV Cache;现在会额外包含 output token,导致写入数据与 prefix cache 命中时的期望不匹配,可能造成 cache 污染或 token_ids 长度越界。

后续一行已增加截断 input_token_ids = input_token_ids[: len(keys) * block_size],但这只能防止越界,无法还原原有的语义区分。请确认此变更是否为预期行为。

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-08 11:14:59

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

所有 Required 任务均通过(本 PR 无 Required 任务);有 1 个可选任务失败,不阻塞合并。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
1(0) 1 0 1 0 0 0

2 任务状态汇总

2.1 Required任务 : 0/0 通过

本 PR 未配置 Required 任务(Branch Protection Rules 中无必选任务),不阻塞合并。

2.2 可选任务 — 0/1 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Trigger Jenkins for PR 50s Job -

3 失败详情(仅 required)

无 required 失败任务。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants