[Bug] No pipe sync inserted between MTE-pipe ops (pto.tstore / pto.tload, local or remote) and pto.comm.tnotify — signal can overtake in-flight data

### Component

EmitC / Codegen (lib/PTO/Transforms/PTOToEmitC.cpp)

### Description

When PTOAS lowers `pto.comm.tnotify`, it emits the `pto::comm::TNOTIFY(...)` call without first draining MTE-side pipes. Any prior op that uses MTE (any `pto.tload` or `pto.tstore`, regardless of whether the source/destination is local or peer-addressed) may still be in flight at the moment the signal write is issued. The inner signal store inside `TNOTIFY_IMPL` runs on the scalar pipe and is **not** ordered against MTE — and `TNOTIFY_IMPL`'s own trailing `pipe_barrier(PIPE_ALL)` happens *after* the signal write, so it does not save us either.

This breaks the contract that a `notify`/`wait` handshake implies "everything I issued before the notify is visible after the matching wait". All four variants are affected:

- `pto.tstore` to a peer-addressed `partition_tensor_view` (most reliably broken — remote SDMA latency is the largest)
- `pto.tstore` to a local `partition_tensor_view` (data may still be in MTE3 when signal lands; the local consumer reads stale bytes)
- `pto.tload` from a peer-addressed view (in-flight read may not be complete when caller signals downstream that "input is consumed")
- `pto.tload` from a local view (same hazard if the load result feeds a signal-driven downstream)

The fix belongs in the lowering of `pto.comm.tnotify`: drain MTE-side pipes before the call. Simplest correct form is `pipe_barrier(PIPE_ALL);` immediately before the generated `pto::comm::TNOTIFY(...)`. Equivalent forms: `set_flag(PIPE_MTE3, PIPE_S, EID); wait_flag(PIPE_MTE3, PIPE_S, EID)` (plus matching MTE2 if reads are in flight), or wiring the relevant event handles into the variadic `WaitEvents` parameter of `TNOTIFY(...)`. The fix must live in the synthesizer so callers do not need to manually insert sync.

### Reproduction (minimal)

Minimal PTO MLIR exhibiting the bug — **two** `pto.tstore` ops to disjoint offsets of the same peer window, followed by a single `pto.comm.tnotify` (this is the smallest pattern that reliably fails in practice; the second SDMA write is the one most likely to be in flight when the signal is issued):

```mlir
module attributes {pto.target_arch = \"a2a3\"} {
  func.func @repro(%arg_low:   !pto.ptr<f32>,        // local source A (inp_low)
                   %arg_high:  !pto.ptr<f32>,        // local source B (inp_high)
                   %arg_dst:   !pto.ptr<f32>,        // local dst, window-bound
                   %arg_sig:   !pto.ptr<i32>,        // local signal, window-bound
                   %peer:      i32,                   // peer rank
                   %ctx_f32:   !pto.ptr<i64>,        // comm ctx for f32
                   %ctx_i32:   !pto.ptr<i64>)        // comm ctx for i32
                   attributes {pto.kernel_kind = #pto.kernel_kind<vector>} {
    %c0_i64 = arith.constant 0  : i64
    %c0     = arith.constant 0  : index
    %c1     = arith.constant 1  : index
    %c32    = arith.constant 32 : index
    %c64    = arith.constant 64 : index
    %c1_i32 = arith.constant 1  : i32
    %peer_idx = arith.index_cast %peer : i32 to index

    // ===== First half: tile_low = TLOAD(inp_low); TSTORE peer.dst[0:32] = tile_low =====
    %tile_low = pto.alloc_tile addr = %c0_i64 valid_row = %c1 valid_col = %c32
                : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=32,
                                v_row=?, v_col=?, blayout=row_major,
                                slayout=none_box, fractal=512, pad=0>
    %low_view = pto.make_tensor_view %arg_low,
                shape = [%c1, %c32], strides = [%c32, %c1]
                {layout = #pto.layout<nd>} : !pto.tensor_view<?x?xf32>
    %low_pv   = pto.partition_view %low_view,
                offsets = [%c0, %c0], sizes = [%c1, %c32]
                : !pto.tensor_view<?x?xf32>
                  -> !pto.partition_tensor_view<1x32xf32>
    pto.tload ins(%low_pv : !pto.partition_tensor_view<1x32xf32>)
              outs(%tile_low : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=32,
                                              v_row=?, v_col=?, blayout=row_major,
                                              slayout=none_box, fractal=512, pad=0>)

    %off_a   = func.call @CommRemoteOffset_f32(%ctx_f32, %peer_idx)
               : (!pto.ptr<i64>, index) -> index
    %dst_a   = pto.addptr %arg_dst, %off_a : !pto.ptr<f32> -> !pto.ptr<f32>
    %rs      = arith.muli %c1, %c64 : index
    %view_a  = pto.make_tensor_view %dst_a,
               shape = [%c1, %c64], strides = [%rs, %c1]
               {layout = #pto.layout<nd>} : !pto.tensor_view<?x?xf32>
    %peer_pv_low = pto.partition_view %view_a,
                   offsets = [%c0, %c0], sizes = [%c1, %c32]
                   : !pto.tensor_view<?x?xf32>
                     -> !pto.partition_tensor_view<1x32xf32>
    pto.tstore ins(%tile_low : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=32,
                                              v_row=?, v_col=?, blayout=row_major,
                                              slayout=none_box, fractal=512, pad=0>)
               outs(%peer_pv_low : !pto.partition_tensor_view<1x32xf32>)

    // ===== Second half: tile_high = TLOAD(inp_high); TSTORE peer.dst[32:64] = tile_high =====
    %tile_high = pto.alloc_tile addr = %c0_i64 valid_row = %c1 valid_col = %c32
                 : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=32,
                                 v_row=?, v_col=?, blayout=row_major,
                                 slayout=none_box, fractal=512, pad=0>
    %high_view = pto.make_tensor_view %arg_high,
                 shape = [%c1, %c32], strides = [%c32, %c1]
                 {layout = #pto.layout<nd>} : !pto.tensor_view<?x?xf32>
    %high_pv   = pto.partition_view %high_view,
                 offsets = [%c0, %c0], sizes = [%c1, %c32]
                 : !pto.tensor_view<?x?xf32>
                   -> !pto.partition_tensor_view<1x32xf32>
    pto.tload ins(%high_pv : !pto.partition_tensor_view<1x32xf32>)
              outs(%tile_high : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=32,
                                               v_row=?, v_col=?, blayout=row_major,
                                               slayout=none_box, fractal=512, pad=0>)

    %off_b   = func.call @CommRemoteOffset_f32(%ctx_f32, %peer_idx)
               : (!pto.ptr<i64>, index) -> index
    %dst_b   = pto.addptr %arg_dst, %off_b : !pto.ptr<f32> -> !pto.ptr<f32>
    %rs2     = arith.muli %c1, %c64 : index
    %view_b  = pto.make_tensor_view %dst_b,
               shape = [%c1, %c64], strides = [%rs2, %c1]
               {layout = #pto.layout<nd>} : !pto.tensor_view<?x?xf32>
    %peer_pv_high = pto.partition_view %view_b,
                    offsets = [%c0, %c32], sizes = [%c1, %c32]
                    : !pto.tensor_view<?x?xf32>
                      -> !pto.partition_tensor_view<1x32xf32>
    pto.tstore ins(%tile_high : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=32,
                                                v_row=?, v_col=?, blayout=row_major,
                                                slayout=none_box, fractal=512, pad=0>)
               outs(%peer_pv_high : !pto.partition_tensor_view<1x32xf32>)

    // ===== Notify peer that both halves are done =====
    %sig_off   = func.call @CommRemoteOffset_i32(%ctx_i32, %peer_idx)
                 : (!pto.ptr<i64>, index) -> index
    %sig_ptr   = pto.addptr %arg_sig, %sig_off : !pto.ptr<i32> -> !pto.ptr<i32>
    %sig_view  = pto.make_tensor_view %sig_ptr,
                 shape = [%c1, %c1], strides = [%c1, %c1]
                 {layout = #pto.layout<nd>} : !pto.tensor_view<?x?xi32>
    %sig_pv    = pto.partition_view %sig_view,
                 offsets = [%c0, %c0], sizes = [%c1, %c1]
                 : !pto.tensor_view<?x?xi32>
                   -> !pto.partition_tensor_view<1x1xi32>
    pto.comm.tnotify(%sig_pv, %c1_i32 : !pto.partition_tensor_view<1x1xi32>, i32)
                     {notifyOp = #pto<notify_op set>}
    return
  }
}
```

Lower with `ptoas` for the a2a3 backend (any vector-kernel invocation that goes through `PTOToEmitC`) and inspect the emitted `.cpp`. Two `TSTORE` calls appear back-to-back, followed by `pto::comm::TNOTIFY` with no MTE drain — exactly the race window.

The same defect surfaces with the symmetric pattern `pto.tload` (local or peer) before `pto.comm.tnotify`. The remote-store-with-two-stores variant is just the easiest to observe failing.

### Expected behavior

PTOAS should emit pipe synchronization that drains all MTE-side pipes before the `pto::comm::TNOTIFY(...)` call. For example, the lowering of `pto.comm.tnotify` should produce:

```cpp
// Drain prior MTE-pipe ops (loads or stores, local or remote) before issuing
// the signal write — otherwise the signal can overtake in-flight SDMA data.
pipe_barrier(PIPE_ALL);
pto::comm::TNOTIFY(<sig_view>, <value>, <notifyOp>);
```

or equivalently a `set_flag/wait_flag` pair pinning MTE3/MTE2 against PIPE_S, or by passing the relevant event handles into the `TNOTIFY(...)` template's variadic `WaitEvents` parameter.

After this, the contract `peer_TWAIT_returns ⇒ all data written/loaded before my TNOTIFY is visible` holds for every kernel that combines MTE-pipe ops with a notify, with no caller-side workaround.

### Actual behavior / error logs

Emitted C++ for the two-store snippet above (illustrative; only the relevant lines shown):

```text
// ---- first pto.tstore lowering (peer write, low half) ----
TLOAD(<tile_low>, <low_view>);                       // MTE2
set_flag(PIPE_MTE2, PIPE_MTE3, EVENT_ID0);
[CommRemoteOffset_f32 + addptr + GlobalTensor setup]
wait_flag(PIPE_MTE2, PIPE_MTE3, EVENT_ID0);
TSTORE(<peer_view_low>, <tile_low>);                 // MTE3 SDMA write #1
set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0);

// ---- second pto.tstore lowering (peer write, high half) ----
wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0);
TLOAD(<tile_high>, <high_view>);                     // MTE2
set_flag(PIPE_MTE2, PIPE_MTE3, EVENT_ID1);
[CommRemoteOffset_f32 + addptr + GlobalTensor setup]
wait_flag(PIPE_MTE2, PIPE_MTE3, EVENT_ID1);
TSTORE(<peer_view_high>, <tile_high>);               // MTE3 SDMA write #2
set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID1);           // only consumed by a *later* MTE2 op
[CommRemoteOffset_i32 + addptr + GlobalTensor setup for signal]

// ---- pto.comm.tnotify lowering ----
pto::comm::TNOTIFY(<sig_view>, 1, set);              // <-- no MTE drain before this
```

`TNOTIFY_IMPL` inside the runtime (pto-isa) writes the signal before its own trailing `pipe_barrier(PIPE_ALL)`:

```cpp
// pto-isa: include/pto/comm/a2a3/TNotify.hpp
template <typename GlobalSignalData>
PTO_INTERNAL void TNOTIFY_IMPL(GlobalSignalData &dstSignalData, int32_t value, NotifyOp op)
{
    volatile __gm__ int32_t *sigPtr = (volatile __gm__ int32_t *)dstSignalData.data();
    if (op == NotifyOp::AtomicAdd) {
        set_st_atomic_cfg(ATOMIC_S32, ATOMIC_SUM);
        detail::DcciSignal(sigPtr);
        st_atomic<int32_t>(value, sigPtr);    // <-- signal write, scalar pipe
        detail::DcciSignal(sigPtr);
        dsb(DSB_DDR);
    } else {
        detail::DcciSignal(sigPtr);
        *sigPtr = value;                       // <-- signal write, scalar pipe
        detail::DcciSignal(sigPtr);
        dsb(DSB_DDR);
    }
    pipe_barrier(PIPE_ALL);                    // <-- too late: signal already in flight
}
```

Observed runtime symptom: the receiving rank's `TWAIT` returns, then reads its window and finds zeros (or stale values) at the bytes the prior remote `TSTORE` was supposed to write. With **one** `pto.tstore` + `pto.comm.tnotify` the race is small and usually wins (the lone SDMA finishes during the scalar setup before the signal); with **two** back-to-back `pto.tstore` ops the second SDMA reliably loses and the receiver observes its bytes as zero. Symmetric cases — `pto.tstore` to local memory, `pto.tload` from local/peer — share the same hazard whenever the loaded/stored value is the thing the notify is meant to announce.

Related: `pto.comm.tput` has a sibling sync gap tracked in #706 (`--enable-insert-sync` misses the `MTE3 → MTE2` hazard between a preceding `pto.tstore` and the inner `TLOAD` issued by `TPUT_IMPL`). The two issues share the same root cause family — the auto-sync pass does not consider MTE-side hazards across comm-op boundaries — and likely want a unified fix.

### Git commit

release v0.41

### Host platform

Linux (aarch64)

### Target Ascend arch (if relevant)

a3

### PTOAS build level (if relevant)

level3


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] No pipe sync inserted between MTE-pipe ops (pto.tstore / pto.tload, local or remote) and pto.comm.tnotify — signal can overtake in-flight data #711

Component

Description

Reproduction (minimal)

Expected behavior

Actual behavior / error logs

Git commit

Host platform

Target Ascend arch (if relevant)

PTOAS build level (if relevant)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] No pipe sync inserted between MTE-pipe ops (pto.tstore / pto.tload, local or remote) and pto.comm.tnotify — signal can overtake in-flight data #711

Description

Component

Description

Reproduction (minimal)

Expected behavior

Actual behavior / error logs

Git commit

Host platform

Target Ascend arch (if relevant)

PTOAS build level (if relevant)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions