Component
EmitC / Codegen (lib/PTO/Transforms/PTOToEmitC.cpp)
Description
When PTOAS lowers pto.comm.tnotify, it emits the pto::comm::TNOTIFY(...) call without first draining MTE-side pipes. Any prior op that uses MTE (any pto.tload or pto.tstore, regardless of whether the source/destination is local or peer-addressed) may still be in flight at the moment the signal write is issued. The inner signal store inside TNOTIFY_IMPL runs on the scalar pipe and is not ordered against MTE — and TNOTIFY_IMPL's own trailing pipe_barrier(PIPE_ALL) happens after the signal write, so it does not save us either.
This breaks the contract that a notify/wait handshake implies "everything I issued before the notify is visible after the matching wait". All four variants are affected:
pto.tstore to a peer-addressed partition_tensor_view (most reliably broken — remote SDMA latency is the largest)
pto.tstore to a local partition_tensor_view (data may still be in MTE3 when signal lands; the local consumer reads stale bytes)
pto.tload from a peer-addressed view (in-flight read may not be complete when caller signals downstream that "input is consumed")
pto.tload from a local view (same hazard if the load result feeds a signal-driven downstream)
The fix belongs in the lowering of pto.comm.tnotify: drain MTE-side pipes before the call. Simplest correct form is pipe_barrier(PIPE_ALL); immediately before the generated pto::comm::TNOTIFY(...). Equivalent forms: set_flag(PIPE_MTE3, PIPE_S, EID); wait_flag(PIPE_MTE3, PIPE_S, EID) (plus matching MTE2 if reads are in flight), or wiring the relevant event handles into the variadic WaitEvents parameter of TNOTIFY(...). The fix must live in the synthesizer so callers do not need to manually insert sync.
Reproduction (minimal)
Minimal PTO MLIR exhibiting the bug — two pto.tstore ops to disjoint offsets of the same peer window, followed by a single pto.comm.tnotify (this is the smallest pattern that reliably fails in practice; the second SDMA write is the one most likely to be in flight when the signal is issued):
module attributes {pto.target_arch = \"a2a3\"} {
func.func @repro(%arg_low: !pto.ptr<f32>, // local source A (inp_low)
%arg_high: !pto.ptr<f32>, // local source B (inp_high)
%arg_dst: !pto.ptr<f32>, // local dst, window-bound
%arg_sig: !pto.ptr<i32>, // local signal, window-bound
%peer: i32, // peer rank
%ctx_f32: !pto.ptr<i64>, // comm ctx for f32
%ctx_i32: !pto.ptr<i64>) // comm ctx for i32
attributes {pto.kernel_kind = #pto.kernel_kind<vector>} {
%c0_i64 = arith.constant 0 : i64
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%c32 = arith.constant 32 : index
%c64 = arith.constant 64 : index
%c1_i32 = arith.constant 1 : i32
%peer_idx = arith.index_cast %peer : i32 to index
// ===== First half: tile_low = TLOAD(inp_low); TSTORE peer.dst[0:32] = tile_low =====
%tile_low = pto.alloc_tile addr = %c0_i64 valid_row = %c1 valid_col = %c32
: !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=32,
v_row=?, v_col=?, blayout=row_major,
slayout=none_box, fractal=512, pad=0>
%low_view = pto.make_tensor_view %arg_low,
shape = [%c1, %c32], strides = [%c32, %c1]
{layout = #pto.layout<nd>} : !pto.tensor_view<?x?xf32>
%low_pv = pto.partition_view %low_view,
offsets = [%c0, %c0], sizes = [%c1, %c32]
: !pto.tensor_view<?x?xf32>
-> !pto.partition_tensor_view<1x32xf32>
pto.tload ins(%low_pv : !pto.partition_tensor_view<1x32xf32>)
outs(%tile_low : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=32,
v_row=?, v_col=?, blayout=row_major,
slayout=none_box, fractal=512, pad=0>)
%off_a = func.call @CommRemoteOffset_f32(%ctx_f32, %peer_idx)
: (!pto.ptr<i64>, index) -> index
%dst_a = pto.addptr %arg_dst, %off_a : !pto.ptr<f32> -> !pto.ptr<f32>
%rs = arith.muli %c1, %c64 : index
%view_a = pto.make_tensor_view %dst_a,
shape = [%c1, %c64], strides = [%rs, %c1]
{layout = #pto.layout<nd>} : !pto.tensor_view<?x?xf32>
%peer_pv_low = pto.partition_view %view_a,
offsets = [%c0, %c0], sizes = [%c1, %c32]
: !pto.tensor_view<?x?xf32>
-> !pto.partition_tensor_view<1x32xf32>
pto.tstore ins(%tile_low : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=32,
v_row=?, v_col=?, blayout=row_major,
slayout=none_box, fractal=512, pad=0>)
outs(%peer_pv_low : !pto.partition_tensor_view<1x32xf32>)
// ===== Second half: tile_high = TLOAD(inp_high); TSTORE peer.dst[32:64] = tile_high =====
%tile_high = pto.alloc_tile addr = %c0_i64 valid_row = %c1 valid_col = %c32
: !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=32,
v_row=?, v_col=?, blayout=row_major,
slayout=none_box, fractal=512, pad=0>
%high_view = pto.make_tensor_view %arg_high,
shape = [%c1, %c32], strides = [%c32, %c1]
{layout = #pto.layout<nd>} : !pto.tensor_view<?x?xf32>
%high_pv = pto.partition_view %high_view,
offsets = [%c0, %c0], sizes = [%c1, %c32]
: !pto.tensor_view<?x?xf32>
-> !pto.partition_tensor_view<1x32xf32>
pto.tload ins(%high_pv : !pto.partition_tensor_view<1x32xf32>)
outs(%tile_high : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=32,
v_row=?, v_col=?, blayout=row_major,
slayout=none_box, fractal=512, pad=0>)
%off_b = func.call @CommRemoteOffset_f32(%ctx_f32, %peer_idx)
: (!pto.ptr<i64>, index) -> index
%dst_b = pto.addptr %arg_dst, %off_b : !pto.ptr<f32> -> !pto.ptr<f32>
%rs2 = arith.muli %c1, %c64 : index
%view_b = pto.make_tensor_view %dst_b,
shape = [%c1, %c64], strides = [%rs2, %c1]
{layout = #pto.layout<nd>} : !pto.tensor_view<?x?xf32>
%peer_pv_high = pto.partition_view %view_b,
offsets = [%c0, %c32], sizes = [%c1, %c32]
: !pto.tensor_view<?x?xf32>
-> !pto.partition_tensor_view<1x32xf32>
pto.tstore ins(%tile_high : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=32,
v_row=?, v_col=?, blayout=row_major,
slayout=none_box, fractal=512, pad=0>)
outs(%peer_pv_high : !pto.partition_tensor_view<1x32xf32>)
// ===== Notify peer that both halves are done =====
%sig_off = func.call @CommRemoteOffset_i32(%ctx_i32, %peer_idx)
: (!pto.ptr<i64>, index) -> index
%sig_ptr = pto.addptr %arg_sig, %sig_off : !pto.ptr<i32> -> !pto.ptr<i32>
%sig_view = pto.make_tensor_view %sig_ptr,
shape = [%c1, %c1], strides = [%c1, %c1]
{layout = #pto.layout<nd>} : !pto.tensor_view<?x?xi32>
%sig_pv = pto.partition_view %sig_view,
offsets = [%c0, %c0], sizes = [%c1, %c1]
: !pto.tensor_view<?x?xi32>
-> !pto.partition_tensor_view<1x1xi32>
pto.comm.tnotify(%sig_pv, %c1_i32 : !pto.partition_tensor_view<1x1xi32>, i32)
{notifyOp = #pto<notify_op set>}
return
}
}
Lower with ptoas for the a2a3 backend (any vector-kernel invocation that goes through PTOToEmitC) and inspect the emitted .cpp. Two TSTORE calls appear back-to-back, followed by pto::comm::TNOTIFY with no MTE drain — exactly the race window.
The same defect surfaces with the symmetric pattern pto.tload (local or peer) before pto.comm.tnotify. The remote-store-with-two-stores variant is just the easiest to observe failing.
Expected behavior
PTOAS should emit pipe synchronization that drains all MTE-side pipes before the pto::comm::TNOTIFY(...) call. For example, the lowering of pto.comm.tnotify should produce:
// Drain prior MTE-pipe ops (loads or stores, local or remote) before issuing
// the signal write — otherwise the signal can overtake in-flight SDMA data.
pipe_barrier(PIPE_ALL);
pto::comm::TNOTIFY(<sig_view>, <value>, <notifyOp>);
or equivalently a set_flag/wait_flag pair pinning MTE3/MTE2 against PIPE_S, or by passing the relevant event handles into the TNOTIFY(...) template's variadic WaitEvents parameter.
After this, the contract peer_TWAIT_returns ⇒ all data written/loaded before my TNOTIFY is visible holds for every kernel that combines MTE-pipe ops with a notify, with no caller-side workaround.
Actual behavior / error logs
Emitted C++ for the two-store snippet above (illustrative; only the relevant lines shown):
// ---- first pto.tstore lowering (peer write, low half) ----
TLOAD(<tile_low>, <low_view>); // MTE2
set_flag(PIPE_MTE2, PIPE_MTE3, EVENT_ID0);
[CommRemoteOffset_f32 + addptr + GlobalTensor setup]
wait_flag(PIPE_MTE2, PIPE_MTE3, EVENT_ID0);
TSTORE(<peer_view_low>, <tile_low>); // MTE3 SDMA write #1
set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0);
// ---- second pto.tstore lowering (peer write, high half) ----
wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0);
TLOAD(<tile_high>, <high_view>); // MTE2
set_flag(PIPE_MTE2, PIPE_MTE3, EVENT_ID1);
[CommRemoteOffset_f32 + addptr + GlobalTensor setup]
wait_flag(PIPE_MTE2, PIPE_MTE3, EVENT_ID1);
TSTORE(<peer_view_high>, <tile_high>); // MTE3 SDMA write #2
set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID1); // only consumed by a *later* MTE2 op
[CommRemoteOffset_i32 + addptr + GlobalTensor setup for signal]
// ---- pto.comm.tnotify lowering ----
pto::comm::TNOTIFY(<sig_view>, 1, set); // <-- no MTE drain before this
TNOTIFY_IMPL inside the runtime (pto-isa) writes the signal before its own trailing pipe_barrier(PIPE_ALL):
// pto-isa: include/pto/comm/a2a3/TNotify.hpp
template <typename GlobalSignalData>
PTO_INTERNAL void TNOTIFY_IMPL(GlobalSignalData &dstSignalData, int32_t value, NotifyOp op)
{
volatile __gm__ int32_t *sigPtr = (volatile __gm__ int32_t *)dstSignalData.data();
if (op == NotifyOp::AtomicAdd) {
set_st_atomic_cfg(ATOMIC_S32, ATOMIC_SUM);
detail::DcciSignal(sigPtr);
st_atomic<int32_t>(value, sigPtr); // <-- signal write, scalar pipe
detail::DcciSignal(sigPtr);
dsb(DSB_DDR);
} else {
detail::DcciSignal(sigPtr);
*sigPtr = value; // <-- signal write, scalar pipe
detail::DcciSignal(sigPtr);
dsb(DSB_DDR);
}
pipe_barrier(PIPE_ALL); // <-- too late: signal already in flight
}
Observed runtime symptom: the receiving rank's TWAIT returns, then reads its window and finds zeros (or stale values) at the bytes the prior remote TSTORE was supposed to write. With one pto.tstore + pto.comm.tnotify the race is small and usually wins (the lone SDMA finishes during the scalar setup before the signal); with two back-to-back pto.tstore ops the second SDMA reliably loses and the receiver observes its bytes as zero. Symmetric cases — pto.tstore to local memory, pto.tload from local/peer — share the same hazard whenever the loaded/stored value is the thing the notify is meant to announce.
Related: pto.comm.tput has a sibling sync gap tracked in #706 (--enable-insert-sync misses the MTE3 → MTE2 hazard between a preceding pto.tstore and the inner TLOAD issued by TPUT_IMPL). The two issues share the same root cause family — the auto-sync pass does not consider MTE-side hazards across comm-op boundaries — and likely want a unified fix.
Git commit
release v0.41
Host platform
Linux (aarch64)
Target Ascend arch (if relevant)
a3
PTOAS build level (if relevant)
level3
Component
EmitC / Codegen (lib/PTO/Transforms/PTOToEmitC.cpp)
Description
When PTOAS lowers
pto.comm.tnotify, it emits thepto::comm::TNOTIFY(...)call without first draining MTE-side pipes. Any prior op that uses MTE (anypto.tloadorpto.tstore, regardless of whether the source/destination is local or peer-addressed) may still be in flight at the moment the signal write is issued. The inner signal store insideTNOTIFY_IMPLruns on the scalar pipe and is not ordered against MTE — andTNOTIFY_IMPL's own trailingpipe_barrier(PIPE_ALL)happens after the signal write, so it does not save us either.This breaks the contract that a
notify/waithandshake implies "everything I issued before the notify is visible after the matching wait". All four variants are affected:pto.tstoreto a peer-addressedpartition_tensor_view(most reliably broken — remote SDMA latency is the largest)pto.tstoreto a localpartition_tensor_view(data may still be in MTE3 when signal lands; the local consumer reads stale bytes)pto.tloadfrom a peer-addressed view (in-flight read may not be complete when caller signals downstream that "input is consumed")pto.tloadfrom a local view (same hazard if the load result feeds a signal-driven downstream)The fix belongs in the lowering of
pto.comm.tnotify: drain MTE-side pipes before the call. Simplest correct form ispipe_barrier(PIPE_ALL);immediately before the generatedpto::comm::TNOTIFY(...). Equivalent forms:set_flag(PIPE_MTE3, PIPE_S, EID); wait_flag(PIPE_MTE3, PIPE_S, EID)(plus matching MTE2 if reads are in flight), or wiring the relevant event handles into the variadicWaitEventsparameter ofTNOTIFY(...). The fix must live in the synthesizer so callers do not need to manually insert sync.Reproduction (minimal)
Minimal PTO MLIR exhibiting the bug — two
pto.tstoreops to disjoint offsets of the same peer window, followed by a singlepto.comm.tnotify(this is the smallest pattern that reliably fails in practice; the second SDMA write is the one most likely to be in flight when the signal is issued):Lower with
ptoasfor the a2a3 backend (any vector-kernel invocation that goes throughPTOToEmitC) and inspect the emitted.cpp. TwoTSTOREcalls appear back-to-back, followed bypto::comm::TNOTIFYwith no MTE drain — exactly the race window.The same defect surfaces with the symmetric pattern
pto.tload(local or peer) beforepto.comm.tnotify. The remote-store-with-two-stores variant is just the easiest to observe failing.Expected behavior
PTOAS should emit pipe synchronization that drains all MTE-side pipes before the
pto::comm::TNOTIFY(...)call. For example, the lowering ofpto.comm.tnotifyshould produce:or equivalently a
set_flag/wait_flagpair pinning MTE3/MTE2 against PIPE_S, or by passing the relevant event handles into theTNOTIFY(...)template's variadicWaitEventsparameter.After this, the contract
peer_TWAIT_returns ⇒ all data written/loaded before my TNOTIFY is visibleholds for every kernel that combines MTE-pipe ops with a notify, with no caller-side workaround.Actual behavior / error logs
Emitted C++ for the two-store snippet above (illustrative; only the relevant lines shown):
TNOTIFY_IMPLinside the runtime (pto-isa) writes the signal before its own trailingpipe_barrier(PIPE_ALL):Observed runtime symptom: the receiving rank's
TWAITreturns, then reads its window and finds zeros (or stale values) at the bytes the prior remoteTSTOREwas supposed to write. With onepto.tstore+pto.comm.tnotifythe race is small and usually wins (the lone SDMA finishes during the scalar setup before the signal); with two back-to-backpto.tstoreops the second SDMA reliably loses and the receiver observes its bytes as zero. Symmetric cases —pto.tstoreto local memory,pto.tloadfrom local/peer — share the same hazard whenever the loaded/stored value is the thing the notify is meant to announce.Related:
pto.comm.tputhas a sibling sync gap tracked in #706 (--enable-insert-syncmisses theMTE3 → MTE2hazard between a precedingpto.tstoreand the innerTLOADissued byTPUT_IMPL). The two issues share the same root cause family — the auto-sync pass does not consider MTE-side hazards across comm-op boundaries — and likely want a unified fix.Git commit
release v0.41
Host platform
Linux (aarch64)
Target Ascend arch (if relevant)
a3
PTOAS build level (if relevant)
level3