Skip to content

[BugFix][KSM] Fix sampling_mask reordering in recover_batch_index_for…#7773

Merged
yuanlehome merged 1 commit intoPaddlePaddle:release/2.6from
DesmonDay:release/2.6
May 11, 2026
Merged

[BugFix][KSM] Fix sampling_mask reordering in recover_batch_index_for…#7773
yuanlehome merged 1 commit intoPaddlePaddle:release/2.6from
DesmonDay:release/2.6

Conversation

@DesmonDay
Copy link
Copy Markdown
Contributor

Motivation

开启 enable_keep_sampling_mask (KSM) + pd_reorder 时,recover_batch_index_for_sampler_outputlogprobs_tensorssampled_token_ids 等字段做了批次顺序恢复,但遗漏了 sampling_mask,导致候选词表与 logprobs 错位配对,logz 归一化使用了错误请求的候选集,产生异常 logprob 值,最终导致下游 RL 训练 KL 爆炸(最高达 64048)。

Modifications

  • fastdeploy/worker/input_batch.py:在 recover_batch_index_for_sampler_output 函数末尾,补充对 sampling_maskList[np.ndarray])的重排逻辑,使用与其他字段相同的 src_order 排列,保持 sampling_mask[i]logprobs[i] 始终属于同一请求
  • tests/worker/test_recover_batch_index_sampling_mask.py:新增 5 个单元测试,覆盖 None 值、正常重排、恒等映射 noop、禁用 pd_reorder、tail 元素原位保持等场景

Usage or Command

N/A

Accuracy Tests

N/A

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings May 11, 2026 08:07
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 11, 2026

Thanks for your contribution!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 修复在开启 enable_keep_sampling_mask (KSM) 且启用 pd_reorder 时,recover_batch_index_for_sampler_output 未对 sampling_mask 做批次顺序恢复的问题,避免 sampling_masklogprobs 错位导致 logprob 归一化使用错误候选集,进而引发下游训练异常(如 KL 爆炸)。

Changes:

  • recover_batch_index_for_sampler_output 中补齐 sampling_mask 的重排逻辑,使其与 sampled_token_ids/logprobs_tensors 等字段保持同一请求对齐。
  • 新增单元测试文件,覆盖 sampling_mask=None、正常重排、恒等映射、禁用 pd_reorder、以及 tail 元素保持原位等场景。

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
fastdeploy/worker/input_batch.py recover_batch_index_for_sampler_output 增加 sampling_mask 的重排逻辑,以修复 KSM+pd_reorder 下的错位问题。
tests/worker/test_recover_batch_index_sampling_mask.py 新增针对 sampling_mask 重排行为的单元测试覆盖。

Comment on lines +1214 to +1220
sort_len = len(src_order)
real_sampling_mask = [None] * len(sampling_mask)
for i in range(sort_len):
real_sampling_mask[i] = sampling_mask[src_order[i]]
for i in range(sort_len, len(sampling_mask)):
real_sampling_mask[i] = sampling_mask[i]
sampler_output.sampling_mask = real_sampling_mask
Comment on lines +10 to +17
def _make_sampler_output(batch_size, with_sampling_mask=True):
"""Create a minimal mock SamplerOutput for testing reorder logic."""
so = Mock()
so.sampled_token_ids = paddle.arange(batch_size, dtype="int64").unsqueeze(1)
so.logprobs_tensors = Mock()
so.logprobs_tensors.logprob_token_ids = paddle.arange(batch_size, dtype="int64").unsqueeze(1)
so.logprobs_tensors.logprobs = paddle.arange(batch_size, dtype="float32").unsqueeze(1)
so.logprobs_tensors.selected_token_ranks = paddle.zeros([batch_size, 1], dtype="int64")
PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-11 16:35:43

📋 Review 摘要

PR 概述:修复 KSM(keep_sampling_mask)+ pd_reorder 开启时,recover_batch_index_for_sampler_output 遗漏对 sampling_mask 字段重排的 bug,避免 logprob 计算错位导致 KL 爆炸
变更范围fastdeploy/worker/input_batch.pytests/worker/
影响面 Tag[BugFix] [PD Disaggregation]

📝 PR 规范检查

发现以下问题:

  1. 非官方 Tag:标题中 [KSM] 不在官方 Tag 列表中(见 checklist §D1)
  2. Cherry-Pick 格式缺失:目标分支为 release/2.6(非 develop),标题格式应为 [Cherry-Pick][BugFix] ... (#原PR号)
  3. Checklist 末项未勾选:提交到 release 分支的 cherry-pick 确认项未勾选

标题建议(可直接复制):

  • [Cherry-Pick][BugFix] Fix sampling_mask reordering in recover_batch_index_for_sampler_output (#原PR号)

PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):

## Motivation
开启 `enable_keep_sampling_mask` (KSM) + `pd_reorder` 时,`recover_batch_index_for_sampler_output``logprobs_tensors``sampled_token_ids` 等字段做了批次顺序恢复,但遗漏了 `sampling_mask`,导致候选词表与 logprobs 错位配对,logz 归一化使用了错误请求的候选集,产生异常 logprob 值,最终导致下游 RL 训练 KL 爆炸(最高达 64048)。

## Modifications
- `fastdeploy/worker/input_batch.py`:在 `recover_batch_index_for_sampler_output` 函数末尾,补充对 `sampling_mask``List[np.ndarray]`)的重排逻辑,使用与其他字段相同的 `src_order` 排列,保持 `sampling_mask[i]``logprobs[i]` 始终属于同一请求
- `tests/worker/test_recover_batch_index_sampling_mask.py`:新增 5 个单元测试,覆盖 None 值、正常重排、恒等映射 noop、禁用 pd_reorder、tail 元素原位保持等场景

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别 文件 概述
📝 PR 规范 标题/Checklist 非官方 Tag [KSM];目标分支 release/2.6 缺少 [Cherry-Pick] 格式;cherry-pick 来源确认 Checklist 未勾选

总体评价

代码逻辑正确,单元测试覆盖全面(5 个场景),有效修复了 PD 分离 + KSM 场景下 sampling_mask 重排遗漏的 bug。仅需调整 PR 标题并确认 cherry-pick 来源流程。

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-11 16:51:50

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

当前有 1 个 required 失败任务(Approval — 缺少 FastDeploy RD 审批),另有 2 个 required 运行中3 个 required 等待中,请关注并处理。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
36(0) 36 27 2 3 4 0

2 任务状态汇总

2.1 Required任务 : 4/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 8s 流程问题:PR 缺少 FastDeploy RD 审批(qingqing01/jiangjiajun/heavengate) 请 qingqing01/jiangjiajun/heavengate 之一审批此 PR Job -
run_ce_cases - 运行中 - Job -
run_tests_with_coverage - 运行中 - Job -
⏸️ run_4_cards_tests - 等待中 - - -
⏸️ run_xpu_4cards_cases - 等待中 - - -
⏸️ run_xpu_8cards_cases - 等待中 - - -
其余 4 个必选任务通过 - - - - -

2.2 可选任务 — 23/26 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Trigger Jenkins for PR 51s Job -
run_iluvatar_cases - Job -
⏸️ CI_HPU - - -
其余 23 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — 流程问题(审批)(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: 流程问题(审批)
  • 置信度: 高
  • 根因摘要: PR 缺少必要的 FastDeploy RD 审批(需 qingqing01/jiangjiajun/heavengate 之一)
  • 分析器: 通用分析(fallback)

根因详情:
check_approval.sh 脚本检查发现 1 个审批错误,退出码 6。脚本要求面向 release/2.6 的 PR 必须获得 FastDeploy 核心 RD(qingqing01/dangqingqing、Jiang-Jia-Jun/jiangjiajun 或 heavengate/dengkaipeng)中至少一人的审批,当前 PR 尚未获得所需审批。

关键日志:

==> PR title: [BugFix][KSM] Fix sampling_mask reordering in recover_batch_index_for…
0. Cherry-Pick PR must come from develop and the title must contain [Cherry-Pick]
   Approval required from FastDeploy RD: qingqing01(dangqingqing), Jiang-Jia-Jun(jiangjiajun), heavengate(dengkaipeng).
There are 1 approved errors.
##[error]Process completed with exit code 6.

修复建议:

  1. 请 qingqing01(dangqingqing)、Jiang-Jia-Jun(jiangjiajun) 或 heavengate(dengkaipeng) 审批此 PR

修复建议摘要: 请 qingqing01/jiangjiajun/heavengate 中任一人审批此 PR

链接: 查看日志

@yuanlehome yuanlehome merged commit f8a0cf2 into PaddlePaddle:release/2.6 May 11, 2026
30 of 37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants