Summary
insert-sync guards a same-address GM store→load round-trip inside one vector kernel with only an MTE3→MTE2 pipe flag. The pipe flag orders the pipes but does not guarantee the GM write is globally visible/committed before the MTE2 read of the same address. The reader (TLOAD) intermittently observes stale / in-flight data, and occasionally the concurrent same-GM-line read/write hangs the MTE engine (the kernel sits in RUNNING forever → AICPU 800k-idle TIMEOUT_EXIT → host 507046).
The defining evidence that this is a sync/visibility bug and not codegen: re-running the exact same compiled binary (same .o/.so) gives three different outcomes — PASS, wrong-values, and on-core hang — i.e. it is timing/visibility dependent, not deterministic.
This is the same family as #696 (missing writer-side dcci+dsb after GM stores), #706 / #711 (missing MTE3→MTE2 hazards around tput/tnotify).
Where it happens (pto level)
Kernel rmsnorm_rope, pto.kernel_kind = vector, target a2a3, launched as spmd (block_num=4). The .pto (insert-sync input) has a write-then-read on the same tensor normed_kv across an SSA-view boundary:
// inside scf.for %k0 (last iter writes cols 448..512):
%normed_kv...__iter_v1_pview = pto.partition_view %normed_kv...__ssa_v0_view, offsets = [%20, %k0], sizes = [16, 64]
pto.tstore ins(%normed_chunk...) outs(%normed_kv...__iter_v1_pview) // GM store, normed_kv[bb, k0:k0+64]
}
// immediately after the loop:
%normed_kv...__rv_v2_pview = pto.partition_view %normed_kv...__ssa_v0_view, offsets = [%20, %c448], sizes = [16, 64]
pto.tload ins(%normed_kv...__rv_v2_pview) outs(%kv_rope_slice...) // GM load, normed_kv[bb, 448:512] <-- reads exactly what the last store wrote
The store target (__iter_v1) and load source (__rv_v2) are different SSA views of the same underlying normed_kv tensor and the same GM address [bb, 448:512].
What insert-sync emitted (generated cpp)
// loop body (last iteration), store:
wait_flag(PIPE_V, PIPE_MTE3, EVENT_ID0);
TSTORE(v120, v116); // normed_kv[bb, k0:k0+64]
set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0); // loop-carried; last set has no in-loop consumer
}
set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID1);
... v125 = GlobalTensor(normed_kv + bb*512 + 448 ...)
wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID1); // <-- intended store->load barrier
TLOAD(v121, v125); // normed_kv[bb, 448:512]
set_flag(PIPE_MTE2, PIPE_V, EVENT_ID3);
wait_flag(PIPE_MTE2, PIPE_V, EVENT_ID3);
TGATHER<...P0101>(v126, v121);
So a single MTE3→MTE2 EVENT_ID1 flag is the only thing guarding the round-trip. On a2a3 this pipe flag is insufficient for a GM store→load of the same address: it fires on MTE3 instruction retire, not on GM write commit/visibility. A writer-side cache maintenance / memory barrier (e.g. dcci+dsb, cf. #696) — or otherwise a guarantee that the store is GM-visible before the dependent MTE2 load issues — appears to be missing.
(Side observation, possibly unrelated: the loop-carried set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0) at the bottom of the store loop has no matching wait consuming the final iteration's set inside the loop; it is only drained in the epilogue.)
Runtime evidence (a2a3 silicon)
Same binary, three runs:
- PASS
- wrong-values:
kv/cmp_kv_cache ~3.96% mismatches, all where golden expects 0 — i.e. the round-trip TLOAD returned stale/garbage that then propagated through the gather/RoPE/scatter.
- hang: scheduler stall diag pins the hung task to this exact kernel:
TASK ring=1 ... state=RUNNING fanin_refcount=3/3 kernels=[aic:-1 aiv0:3 aiv1:-1] running_on=[cores=[core=26(aiv0) core=32(aiv0)]]
SUMMARY completed=4/7 ... scan_running=1
... handle_timeout_exit TIMEOUT_EXIT after_idle_iterations=800000
aiv0:3 is rmsnorm_rope; its fan-in is fully satisfied (3/3) and it is dispatched/RUNNING but never reports completion → host RuntimeError: run_prepared failed with code 507046.
Confirmation that the round-trip is the cause
Rewriting the source so the dependent slice is kept in local (UB) — recomputing the RoPE input tile instead of reading normed_kv back from GM — removes the GM store→load round-trip and makes the kernel pass 5/5 fresh-compile runs (stall and wrong-values both gone). No other change.
Environment
| Component |
Version |
| ptoas |
v0.43 |
| pypto |
a861ec71 |
| pto-isa |
6d785d03 |
| target |
a2a3, kernel_kind=vector, spmd block_num=4 |
Attachments
rmsnorm_rope.pto (insert-sync input) and the generated rmsnorm_rope.cpp will be attached in a comment.
Summary
insert-syncguards a same-address GM store→load round-trip inside onevectorkernel with only anMTE3→MTE2pipe flag. The pipe flag orders the pipes but does not guarantee the GM write is globally visible/committed before theMTE2read of the same address. The reader (TLOAD) intermittently observes stale / in-flight data, and occasionally the concurrent same-GM-line read/write hangs the MTE engine (the kernel sits inRUNNINGforever → AICPU 800k-idleTIMEOUT_EXIT→ host507046).The defining evidence that this is a sync/visibility bug and not codegen: re-running the exact same compiled binary (same
.o/.so) gives three different outcomes — PASS, wrong-values, and on-core hang — i.e. it is timing/visibility dependent, not deterministic.This is the same family as #696 (missing writer-side
dcci+dsbafter GM stores), #706 / #711 (missingMTE3→MTE2hazards around tput/tnotify).Where it happens (pto level)
Kernel
rmsnorm_rope,pto.kernel_kind = vector, targeta2a3, launched as spmd (block_num=4). The.pto(insert-sync input) has a write-then-read on the same tensornormed_kvacross an SSA-view boundary:The store target (
__iter_v1) and load source (__rv_v2) are different SSA views of the same underlyingnormed_kvtensor and the same GM address[bb, 448:512].What insert-sync emitted (generated cpp)
So a single
MTE3→MTE2 EVENT_ID1flag is the only thing guarding the round-trip. Ona2a3this pipe flag is insufficient for a GM store→load of the same address: it fires onMTE3instruction retire, not on GM write commit/visibility. A writer-side cache maintenance / memory barrier (e.g.dcci+dsb, cf. #696) — or otherwise a guarantee that the store is GM-visible before the dependentMTE2load issues — appears to be missing.(Side observation, possibly unrelated: the loop-carried
set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0)at the bottom of the store loop has no matchingwaitconsuming the final iteration's set inside the loop; it is only drained in the epilogue.)Runtime evidence (a2a3 silicon)
Same binary, three runs:
kv/cmp_kv_cache~3.96% mismatches, all where golden expects 0 — i.e. the round-trip TLOAD returned stale/garbage that then propagated through the gather/RoPE/scatter.aiv0:3isrmsnorm_rope; its fan-in is fully satisfied (3/3) and it is dispatched/RUNNING but never reports completion → hostRuntimeError: run_prepared failed with code 507046.Confirmation that the round-trip is the cause
Rewriting the source so the dependent slice is kept in local (UB) — recomputing the RoPE input tile instead of reading
normed_kvback from GM — removes the GM store→load round-trip and makes the kernel pass 5/5 fresh-compile runs (stall and wrong-values both gone). No other change.Environment
Attachments
rmsnorm_rope.pto(insert-sync input) and the generatedrmsnorm_rope.cppwill be attached in a comment.