Skip to content

Fix tprefetch_async and syncall#728

Open
FangRui0 wants to merge 3 commits into
hw-native-sys:mainfrom
FangRui0:fix_verify
Open

Fix tprefetch_async and syncall#728
FangRui0 wants to merge 3 commits into
hw-native-sys:mainfrom
FangRui0:fix_verify

Conversation

@FangRui0
Copy link
Copy Markdown
Contributor

No description provided.

FangRui0 added 2 commits May 29, 2026 10:00
Signed-off-by: FangRui <fangrui_95@163.com>
Signed-off-by: FangRui <fangrui_95@163.com>
@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@FangRui0
Copy link
Copy Markdown
Contributor Author

/run a3

@reedhecre
Copy link
Copy Markdown

已接收 /run a3,A3 板测器会处理这条请求。

页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。

@reedhecre
Copy link
Copy Markdown

reedhecre commented May 29, 2026

Codex Review

该评论由 review 机器人自动更新。

  • PR: Fix tprefetch_async and syncall #728 Fix tprefetch_async and syncall
  • Author: FangRui0
  • Base/Head: main / fix_verify
  • Head SHA: 4ea4127f810b
  • Trigger: PR 有新提交
  • Generated At: 2026-05-29T07:15:23Z
  • Previous Head SHA: 9104ea9a423d
  • Status: completed

Summary

发现 3 个问题:tprefetch_async 样例没有真正验证 prefetch 语义,syncall 样例删掉了之前覆盖的两个变体,runop.shsyncall_binding 的 level3 兜底在已有 --pto-level 覆盖时会失效。

Findings

  1. P2 `tprefetch_async_binding` 仍然会在 `TPREFETCH_ASYNC` 退化成 no-op 时通过 test/samples/TPrefetchAsync/tprefetch_async_binding.py:55

这个样例在等待异步事件后,又用普通的 TLoadOp(src) + TStoreOp(dst) 做了一次显式拷贝,compare 也只检查 dst == src。因此,只要事件能返回,哪怕 TPREFETCH_ASYNC 本身被错误地降成 no-op、或者预取了错误地址,这个用例仍然会通过。对于一个只改测试/validation 资产的 PR 来说,这意味着 tprefetch_async 的目标回归实际上没有被守住。

  1. P2 `syncall_binding` 删除了原先覆盖的 `mix` / `hard` 变体 test/samples/SyncAll/syncall_binding.py:60

变更前这个样例会编译 3 种 pto.syncall 形式:soft + aiv_only、带 l1_workspacesoft + mix,以及零操作数的 hard + mix。现在只剩第一种。runop.sh 是这个样例的唯一自动化覆盖入口,所以这两个先前已覆盖的 lowering 路径现在完全不再进 CI;如果它们回归,这个 PR 会把问题直接放过去。

  1. P3 `runop.sh` 对 `syncall_binding` 的 level3 特判在已有 `--pto-level` 覆盖时不成立 test/samples/runop.sh:474

这里把命令重组为 --pto-level=level3 ${ptoas_flags[@]}。如果调用方已经通过 PTOAS_FLAGS 传了 --pto-level=level1/level2,最终命令就会带上两个冲突的 --pto-level,并不能可靠地保证这个样例真的按 level3 编译。这样会破坏脚本对外暴露的 PTOAS_FLAGS 覆盖契约,并使 syncall_binding 在默认 CI 之外仍然可能失败。

@reedhecre
Copy link
Copy Markdown

A3 板测失败

  • 触发方式:manual
  • 源码提交:f7473a4662a6
  • 结果汇总:OK 218 / FAIL 1 / SKIP 1
  • 日志:/home/zhongxuan/ptoas-board-monitor/runtime/logs/20260529_124005_manual_pr728.log
  • 手动指令:/run a3
  • 触发人:FangRui0
  • 触发评论:Fix tprefetch_async and syncall #728 (comment)
  • 失败阶段:board-validation / exit=1

失败用例

  • tprefetch_async_binding (run, exit=1)

@reedhecre
Copy link
Copy Markdown

A3 板测失败详情:PR #728

tprefetch_async_binding

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260529_124005_manual_pr728/npu_validation/TPrefetchAsync/tprefetch_async_binding/main.cpp:99)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 3563598] 2026-05-29-13:06:04.366.084 (EZ9999):  The error from device(chipId:2, dieId:0), serial number is 435, there is an exception of aivec error, core id is 10, error code = 0, dump info: pc start: 0x124800000000, current: 0x124800000164, vec error info: 0x1e000000a8, mte error info: 0x2060000b0, ifu error info: 0x200007cf20d40, ccu error info: 0x52, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000000.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:645]
        TraceBack (most recent call last):
       The extend info: errcode:(0, 0x200000000000000, 0) errorStr: The MPU address access is invalid. fixp_error0 info: 0x60000b0, fixp_error1 info: 0x2, fsmId:0, tslot:3, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:658]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1729]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1475]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1475]
       [DFX_INFO]Aicore kernel execute failed, device_id=4, stream_id=46, report_stream_id=46, task_id=0, flip_num=0, fault kernel_name=_Z30tprefetch_async_binding_kernelPfS_Pa, fault kernel info ext=_Z30tprefetch_async_binding_kernelPfS_Pa, program id=0, hash=3486264402363174376.[FUNC:GetError][FILE:stream.cc][LINE:1475]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[ERROR]  ./v3.bin: file size (16384) is larger than buffer size (128)
[2026-05-29 13:06:05] ERROR: testcase failed (exit 1): tprefetch_async_binding

Signed-off-by: FangRui <fangrui_95@163.com>
@FangRui0
Copy link
Copy Markdown
Contributor Author

/run a3 tprefetch_async_binding

@reedhecre
Copy link
Copy Markdown

已接收 /run a3 tprefetch_async_binding,A3 板测器会处理这条请求。

页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。

@reedhecre
Copy link
Copy Markdown

A3 板测成功

  • 触发方式:manual
  • 源码提交:7d32ef4a8e26
  • 结果汇总:OK 1 / FAIL 0 / SKIP 0
  • 日志:/home/zhongxuan/ptoas-board-monitor/runtime/logs/20260529_154806_manual_pr728.log
  • 结果 TSV:/home/zhongxuan/ptoas-board-monitor/runtime/logs/20260529_154806_manual_pr728.tsv
  • 手动指令:/run a3 tprefetch_async_binding
  • 触发人:FangRui0
  • 指定用例:tprefetch_async_binding
  • 触发评论:Fix tprefetch_async and syncall #728 (comment)

@FangRui0
Copy link
Copy Markdown
Contributor Author

/run a5 syncall_binding tprefetch_async_binding

@reedhecre
Copy link
Copy Markdown

已接收 /run a5 syncall_binding tprefetch_async_binding,A5 板测器会处理这条请求。

页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。

@reedhecre
Copy link
Copy Markdown

A5 板测失败

  • 触发方式:manual
  • 源码提交:eaa5ed8b40b4
  • 结果汇总:OK 1 / FAIL 1 / SKIP 0
  • 日志:/root/ptoas-board-monitor-a5/logs/20260530_164106_manual_pr728.log
  • 手动指令:/run a5 syncall_binding tprefetch_async_binding
  • 触发人:FangRui0
  • 指定用例:syncall_binding,tprefetch_async_binding
  • 触发评论:Fix tprefetch_async and syncall #728 (comment)
  • 失败阶段:board-validation / exit=1

失败用例

  • tprefetch_async_binding (run, exit=1)

@reedhecre
Copy link
Copy Markdown

A5 板测失败详情:PR #728

tprefetch_async_binding

stage=run info=exit=1

[SDMA] aclrtSynchronizeStream (aicpu) failed
[ERROR] SdmaWorkspaceManager::Init failed
[2026-05-30 16:44:44] ERROR: testcase failed (exit 1): tprefetch_async_binding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants