[Bug] insert-sync: MTE3->MTE2 pipe flag for same-address GM store->load round-trip doesn't guarantee GM visibility — intermittent stale read / MTE hang (a2a3, vector spmd)

### Summary

`insert-sync` guards a **same-address GM store→load round-trip** inside one `vector` kernel with only an `MTE3→MTE2` pipe flag. The pipe flag orders the pipes but does **not** guarantee the GM write is globally visible/committed before the `MTE2` read of the same address. The reader (`TLOAD`) intermittently observes stale / in-flight data, and occasionally the concurrent same-GM-line read/write **hangs the MTE engine** (the kernel sits in `RUNNING` forever → AICPU 800k-idle `TIMEOUT_EXIT` → host `507046`).

The defining evidence that this is a sync/visibility bug and not codegen: **re-running the exact same compiled binary (same `.o`/`.so`) gives three different outcomes** — PASS, wrong-values, and on-core hang — i.e. it is timing/visibility dependent, not deterministic.

This is the same family as #696 (missing writer-side `dcci`+`dsb` after GM stores), #706 / #711 (missing `MTE3→MTE2` hazards around tput/tnotify).

### Where it happens (pto level)

Kernel `rmsnorm_rope`, `pto.kernel_kind = vector`, target `a2a3`, launched as spmd (`block_num=4`). The `.pto` (insert-sync **input**) has a write-then-read on the same tensor `normed_kv` across an SSA-view boundary:

```mlir
// inside scf.for %k0 (last iter writes cols 448..512):
%normed_kv...__iter_v1_pview = pto.partition_view %normed_kv...__ssa_v0_view, offsets = [%20, %k0], sizes = [16, 64]
pto.tstore ins(%normed_chunk...) outs(%normed_kv...__iter_v1_pview)          // GM store, normed_kv[bb, k0:k0+64]
}
// immediately after the loop:
%normed_kv...__rv_v2_pview = pto.partition_view %normed_kv...__ssa_v0_view, offsets = [%20, %c448], sizes = [16, 64]
pto.tload ins(%normed_kv...__rv_v2_pview) outs(%kv_rope_slice...)            // GM load, normed_kv[bb, 448:512]  <-- reads exactly what the last store wrote
```

The store target (`__iter_v1`) and load source (`__rv_v2`) are different SSA views of the **same** underlying `normed_kv` tensor and the **same** GM address `[bb, 448:512]`.

### What insert-sync emitted (generated cpp)

```cpp
// loop body (last iteration), store:
    wait_flag(PIPE_V, PIPE_MTE3, EVENT_ID0);
    TSTORE(v120, v116);                              // normed_kv[bb, k0:k0+64]
    set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0);       // loop-carried; last set has no in-loop consumer
  }
  set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID1);
  ... v125 = GlobalTensor(normed_kv + bb*512 + 448 ...)
  wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID1);        // <-- intended store->load barrier
  TLOAD(v121, v125);                                 // normed_kv[bb, 448:512]
  set_flag(PIPE_MTE2, PIPE_V, EVENT_ID3);
  wait_flag(PIPE_MTE2, PIPE_V, EVENT_ID3);
  TGATHER<...P0101>(v126, v121);
```

So a single `MTE3→MTE2 EVENT_ID1` flag is the only thing guarding the round-trip. On `a2a3` this pipe flag is **insufficient** for a GM store→load of the same address: it fires on `MTE3` instruction retire, not on GM write commit/visibility. A writer-side cache maintenance / memory barrier (e.g. `dcci`+`dsb`, cf. #696) — or otherwise a guarantee that the store is GM-visible before the dependent `MTE2` load issues — appears to be missing.

(Side observation, possibly unrelated: the loop-carried `set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0)` at the bottom of the store loop has no matching `wait` consuming the final iteration's set inside the loop; it is only drained in the epilogue.)

### Runtime evidence (a2a3 silicon)

Same binary, three runs:

- **PASS**
- **wrong-values**: `kv`/`cmp_kv_cache` ~3.96% mismatches, all where golden expects 0 — i.e. the round-trip TLOAD returned stale/garbage that then propagated through the gather/RoPE/scatter.
- **hang**: scheduler stall diag pins the hung task to this exact kernel:
  ```
  TASK ring=1 ... state=RUNNING fanin_refcount=3/3 kernels=[aic:-1 aiv0:3 aiv1:-1] running_on=[cores=[core=26(aiv0) core=32(aiv0)]]
  SUMMARY completed=4/7 ... scan_running=1
  ... handle_timeout_exit TIMEOUT_EXIT after_idle_iterations=800000
  ```
  `aiv0:3` is `rmsnorm_rope`; its fan-in is fully satisfied (3/3) and it is dispatched/RUNNING but never reports completion → host `RuntimeError: run_prepared failed with code 507046`.

### Confirmation that the round-trip is the cause

Rewriting the source so the dependent slice is kept in local (UB) — recomputing the RoPE input tile instead of reading `normed_kv` back from GM — removes the GM store→load round-trip and makes the kernel pass **5/5** fresh-compile runs (stall and wrong-values both gone). No other change.

### Environment

| Component | Version |
|---|---|
| ptoas | v0.43 |
| pypto | a861ec71 |
| pto-isa | 6d785d03 |
| target | a2a3, kernel_kind=vector, spmd block_num=4 |

### Attachments

`rmsnorm_rope.pto` (insert-sync input) and the generated `rmsnorm_rope.cpp` will be attached in a comment.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] insert-sync: MTE3->MTE2 pipe flag for same-address GM store->load round-trip doesn't guarantee GM visibility — intermittent stale read / MTE hang (a2a3, vector spmd) #730

Summary

Where it happens (pto level)

What insert-sync emitted (generated cpp)

Runtime evidence (a2a3 silicon)

Confirmation that the round-trip is the cause

Environment

Attachments

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Component	Version
ptoas	v0.43
pypto	a861ec71
pto-isa	6d785d03
target	a2a3, kernel_kind=vector, spmd block_num=4

[Bug] insert-sync: MTE3->MTE2 pipe flag for same-address GM store->load round-trip doesn't guarantee GM visibility — intermittent stale read / MTE hang (a2a3, vector spmd) #730

Description

Summary

Where it happens (pto level)

What insert-sync emitted (generated cpp)

Runtime evidence (a2a3 silicon)

Confirmation that the round-trip is the cause

Environment

Attachments

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions