Skip to content

[Bug] insert-sync: MTE3->MTE2 pipe flag for same-address GM store->load round-trip doesn't guarantee GM visibility — intermittent stale read / MTE hang (a2a3, vector spmd) #730

@zhangqi-chen

Description

@zhangqi-chen

Summary

insert-sync guards a same-address GM store→load round-trip inside one vector kernel with only an MTE3→MTE2 pipe flag. The pipe flag orders the pipes but does not guarantee the GM write is globally visible/committed before the MTE2 read of the same address. The reader (TLOAD) intermittently observes stale / in-flight data, and occasionally the concurrent same-GM-line read/write hangs the MTE engine (the kernel sits in RUNNING forever → AICPU 800k-idle TIMEOUT_EXIT → host 507046).

The defining evidence that this is a sync/visibility bug and not codegen: re-running the exact same compiled binary (same .o/.so) gives three different outcomes — PASS, wrong-values, and on-core hang — i.e. it is timing/visibility dependent, not deterministic.

This is the same family as #696 (missing writer-side dcci+dsb after GM stores), #706 / #711 (missing MTE3→MTE2 hazards around tput/tnotify).

Where it happens (pto level)

Kernel rmsnorm_rope, pto.kernel_kind = vector, target a2a3, launched as spmd (block_num=4). The .pto (insert-sync input) has a write-then-read on the same tensor normed_kv across an SSA-view boundary:

// inside scf.for %k0 (last iter writes cols 448..512):
%normed_kv...__iter_v1_pview = pto.partition_view %normed_kv...__ssa_v0_view, offsets = [%20, %k0], sizes = [16, 64]
pto.tstore ins(%normed_chunk...) outs(%normed_kv...__iter_v1_pview)          // GM store, normed_kv[bb, k0:k0+64]
}
// immediately after the loop:
%normed_kv...__rv_v2_pview = pto.partition_view %normed_kv...__ssa_v0_view, offsets = [%20, %c448], sizes = [16, 64]
pto.tload ins(%normed_kv...__rv_v2_pview) outs(%kv_rope_slice...)            // GM load, normed_kv[bb, 448:512]  <-- reads exactly what the last store wrote

The store target (__iter_v1) and load source (__rv_v2) are different SSA views of the same underlying normed_kv tensor and the same GM address [bb, 448:512].

What insert-sync emitted (generated cpp)

// loop body (last iteration), store:
    wait_flag(PIPE_V, PIPE_MTE3, EVENT_ID0);
    TSTORE(v120, v116);                              // normed_kv[bb, k0:k0+64]
    set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0);       // loop-carried; last set has no in-loop consumer
  }
  set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID1);
  ... v125 = GlobalTensor(normed_kv + bb*512 + 448 ...)
  wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID1);        // <-- intended store->load barrier
  TLOAD(v121, v125);                                 // normed_kv[bb, 448:512]
  set_flag(PIPE_MTE2, PIPE_V, EVENT_ID3);
  wait_flag(PIPE_MTE2, PIPE_V, EVENT_ID3);
  TGATHER<...P0101>(v126, v121);

So a single MTE3→MTE2 EVENT_ID1 flag is the only thing guarding the round-trip. On a2a3 this pipe flag is insufficient for a GM store→load of the same address: it fires on MTE3 instruction retire, not on GM write commit/visibility. A writer-side cache maintenance / memory barrier (e.g. dcci+dsb, cf. #696) — or otherwise a guarantee that the store is GM-visible before the dependent MTE2 load issues — appears to be missing.

(Side observation, possibly unrelated: the loop-carried set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0) at the bottom of the store loop has no matching wait consuming the final iteration's set inside the loop; it is only drained in the epilogue.)

Runtime evidence (a2a3 silicon)

Same binary, three runs:

  • PASS
  • wrong-values: kv/cmp_kv_cache ~3.96% mismatches, all where golden expects 0 — i.e. the round-trip TLOAD returned stale/garbage that then propagated through the gather/RoPE/scatter.
  • hang: scheduler stall diag pins the hung task to this exact kernel:
    TASK ring=1 ... state=RUNNING fanin_refcount=3/3 kernels=[aic:-1 aiv0:3 aiv1:-1] running_on=[cores=[core=26(aiv0) core=32(aiv0)]]
    SUMMARY completed=4/7 ... scan_running=1
    ... handle_timeout_exit TIMEOUT_EXIT after_idle_iterations=800000
    
    aiv0:3 is rmsnorm_rope; its fan-in is fully satisfied (3/3) and it is dispatched/RUNNING but never reports completion → host RuntimeError: run_prepared failed with code 507046.

Confirmation that the round-trip is the cause

Rewriting the source so the dependent slice is kept in local (UB) — recomputing the RoPE input tile instead of reading normed_kv back from GM — removes the GM store→load round-trip and makes the kernel pass 5/5 fresh-compile runs (stall and wrong-values both gone). No other change.

Environment

Component Version
ptoas v0.43
pypto a861ec71
pto-isa 6d785d03
target a2a3, kernel_kind=vector, spmd block_num=4

Attachments

rmsnorm_rope.pto (insert-sync input) and the generated rmsnorm_rope.cpp will be attached in a comment.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions