[Cherry-Pick][BugFix] redefine tmp_workspace using full tensor in append_attn(#6999)#7002
[Cherry-Pick][BugFix] redefine tmp_workspace using full tensor in append_attn(#6999)#7002lizhenyun01 wants to merge 9 commits intoPaddlePaddle:feature/rl/cpu-cache-20250324from
Conversation
… in API server (PaddlePaddle#6551) (PaddlePaddle#6554) * Initial plan * [BugFix][APIServer] Add control_socket_disable to gunicorn options (cherry-pick of PaddlePaddle#6551) Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
…refix_tree_status_signal not initialized(PaddlePaddle#6531) (PaddlePaddle#6559) * fix mtp acceptance rate decline * [BugFix] Fix AttributeError in recycle_gpu_blocks when prefix_tree_status_signal not initialized - Add hasattr check before accessing prefix_tree_status_signal - The signal is only initialized in launch_cache_messager, not in __init__ - Fixes CI test failure in test_prefix_cache_manager.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [BugFix] Reset prefix cache when model weights are updating - Call self.reset() before setting status to NORMAL in UPDATING state - Ensure cache consistency when model weights change - Consistent with CLEARING state handling Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…Paddle#6597) Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
…#6655) When `is_dummy_run=True`, calling `empty_input_forward` can cause unexpected behavior. Add `and not is_dummy_run` guard for both `_propose_cuda` and `_propose_xpu` paths. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Thanks for your contribution! |
|
|
fastdeploy-bot
left a comment
There was a problem hiding this comment.
AI CI Agent | skill:
pr_review_agent
审查摘要
这个PR将append_attn算子的临时buffer(tmp_workspace、tmp_m、tmp_d)从每层动态分配改为预分配并层间共享,是一个合理的内存优化。
发现的问题
P1 - 测试代码参数顺序错误:test_append_attention.py中两处调用append_attention函数时,tmp_workspace、tmp_m、tmp_d参数被错误地放在了cache_k、cache_v之前,与函数签名定义的顺序不符,会导致测试运行时参数传递错误。
其他变更说明
rollout_model.py中修复了变量名错误(self.fd_config→self.mtp_fd_config)✅prefix_cache_manager.py中添加了保护性检查,防止异常状态下的块回收 ✅mtp.py中添加了not is_dummy_run条件检查 ✅api_server.py中添加了control_socket_disable配置避免多实例冲突 ✅
请修复测试代码中的参数顺序问题后再合并。
| qkv_copy = copy.deepcopy(qkv) | ||
| append_attention( | ||
| qkv_copy, | ||
| self.tmp_workspace, |
There was a problem hiding this comment.
P1 - 参数顺序错误:根据 append_attention 函数签名,参数顺序应为 qkv, key_cache, value_cache, tmp_workspace, tmp_m, tmp_d, ...。这里 tmp_workspace、tmp_m、tmp_d 应该放在 self.cache_k_T、self.cache_v_T 之后,而不是之前。当前顺序会导致参数被错误传递。
| start_time = time.time() | ||
| out = append_attention( | ||
| qkv, | ||
| self.tmp_workspace, |
There was a problem hiding this comment.
P1 - 参数顺序错误:同上,tmp_workspace、tmp_m、tmp_d 应该放在 self.cache_k、self.cache_v 之后。正确顺序应为:qkv, self.cache_k, self.cache_v, self.tmp_workspace, self.tmp_m, self.tmp_d, ...
Motivation
将append_attn算子 split_kv场景下中使用的tmp_workspace以及tmp_m,tmp_d buffer改为由backend传入并在层间共享
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.