Skip to content

[AMDGPU] Implement "non-av" semantics using metadata#199489

Open
ssahasra wants to merge 10 commits into
mainfrom
users/ssahasra/av-metadata
Open

[AMDGPU] Implement "non-av" semantics using metadata#199489
ssahasra wants to merge 10 commits into
mainfrom
users/ssahasra/av-metadata

Conversation

@ssahasra
Copy link
Copy Markdown
Contributor

@ssahasra ssahasra commented May 25, 2026

When the MMRA tag !{!"amdgcn-av", !"none"} is present on a synchronization operation (fence, atomic load/store/rmw/cmpxchg), suppress cache writeback (MakeAvailable) and cache invalidation (MakeVisible) while preserving memory ordering (waits). This implements the metadata proposed in #191246.

Part of a stack:

Fixes: LCOMPILER-2214

Assisted-By: Claude Opus 4.6

ssahasra added 2 commits May 25, 2026 12:04
A release consists of two actions: write-back the current cache, and wait for
"relevant" outstanding operations to complete. With the new memory model, it is
possible to disable the cache write-back using "av none" semantics. This patch
cleanly separates the existing implementation so that the write-backs can be
selectively applied when such metadata is present.

Assisted-By: Claude Opus 4.6
When the MMRA tag !{!"amdgcn-av", !"none"} is present on a synchronization
operation (fence, atomic load/store/rmw/cmpxchg), suppress cache writeback
(MakeAvailable) and cache invalidation (MakeVisible) while preserving
memory ordering (waits).

This implements the metadata proposed in #191246.

Fixes: LCOMPILER-2214

Assisted-By: Claude Opus 4.6
@llvmorg-github-actions
Copy link
Copy Markdown

@llvm/pr-subscribers-backend-amdgpu

Author: Sameer Sahasrabuddhe (ssahasra)

Changes

When the MMRA tag !{!"amdgcn-av", !"none"} is present on a synchronization operation (fence, atomic load/store/rmw/cmpxchg), suppress cache writeback (MakeAvailable) and cache invalidation (MakeVisible) while preserving memory ordering (waits).

This implements the metadata proposed in #191246.

Fixes: LCOMPILER-2214

Assisted-By: Claude Opus 4.6


Patch is 39.91 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/199489.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp (+47-29)
  • (added) llvm/test/CodeGen/AMDGPU/memory-legalizer-av-none.ll (+722)
diff --git a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
index b721dcaf49d0f..f16192d343531 100644
--- a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
@@ -148,6 +148,7 @@ class SIMemOpInfo final {
   bool IsNonTemporal = false;
   bool IsLastUse = false;
   bool IsCooperative = false;
+  bool IsAVNone = false;
 
   // TODO: Should we assume Cooperative=true if no MMO is present?
   SIMemOpInfo(
@@ -160,12 +161,12 @@ class SIMemOpInfo final {
       AtomicOrdering FailureOrdering = AtomicOrdering::SequentiallyConsistent,
       bool IsVolatile = false, bool IsNonTemporal = false,
       bool IsLastUse = false, bool IsCooperative = false,
-      bool CanDemoteWorkgroupToWavefront = false)
+      bool CanDemoteWorkgroupToWavefront = false, bool IsAVNone = false)
       : Ordering(Ordering), FailureOrdering(FailureOrdering), Scope(Scope),
         OrderingAddrSpace(OrderingAddrSpace), InstrAddrSpace(InstrAddrSpace),
         IsCrossAddressSpaceOrdering(IsCrossAddressSpaceOrdering),
         IsVolatile(IsVolatile), IsNonTemporal(IsNonTemporal),
-        IsLastUse(IsLastUse), IsCooperative(IsCooperative) {
+        IsLastUse(IsLastUse), IsCooperative(IsCooperative), IsAVNone(IsAVNone) {
 
     if (Ordering == AtomicOrdering::NotAtomic) {
       assert(!IsCooperative && "Cannot be cooperative & non-atomic!");
@@ -277,6 +278,9 @@ class SIMemOpInfo final {
   /// \returns True if this is a cooperative load or store atomic.
   bool isCooperative() const { return IsCooperative; }
 
+  /// \returns True if MakeAvailable/MakeVisible should be suppressed.
+  bool isAVNone() const { return IsAVNone; }
+
   /// \returns True if ordering constraint of the machine instruction used to
   /// create this SIMemOpInfo is unordered or higher, false otherwise.
   bool isAtomic() const {
@@ -451,13 +455,13 @@ class SICacheControl {
                                SIAtomicScope Scope, SIAtomicAddrSpace AddrSpace,
                                Position Pos) const = 0;
 
-  /// Inserts writeback followed by an unconditional wait to implement a
-  /// release operation.
+  /// Inserts writeback (unless \p IsAVNone) followed by an unconditional wait.
   bool insertRelease(MachineBasicBlock::iterator &MI, SIAtomicScope Scope,
                      SIAtomicAddrSpace AddrSpace, bool IsCrossAddrSpaceOrdering,
-                     Position Pos) const {
+                     Position Pos, bool IsAVNone) const {
     bool Changed = false;
-    Changed |= insertWriteback(MI, Scope, AddrSpace, Pos);
+    if (!IsAVNone)
+      Changed |= insertWriteback(MI, Scope, AddrSpace, Pos);
     Changed |= insertWait(MI, Scope, AddrSpace, SIMemOp::LOAD | SIMemOp::STORE,
                           IsCrossAddrSpaceOrdering, Pos,
                           AtomicOrdering::Release, /*AtomicsOnly=*/false);
@@ -733,6 +737,13 @@ getSynchronizeAddrSpaceMD(const MachineInstr &MI) {
   return Result;
 }
 
+static bool hasAVNoneMMRA(const MachineInstr &MI) {
+  auto MMRA = MMRAMetadata(MI.getMMRAMetadata());
+  if (!MMRA)
+    return false;
+  return MMRA.hasTag("amdgcn-av", "none");
+}
+
 } // end anonymous namespace
 
 void SIMemOpAccess::reportUnsupported(const MachineBasicBlock::iterator &MI,
@@ -876,7 +887,7 @@ std::optional<SIMemOpInfo> SIMemOpAccess::constructFromMIWithMMO(
   return SIMemOpInfo(ST, Ordering, Scope, OrderingAddrSpace, InstrAddrSpace,
                      IsCrossAddressSpaceOrdering, FailureOrdering, IsVolatile,
                      IsNonTemporal, IsLastUse, IsCooperative,
-                     CanDemoteWorkgroupToWavefront);
+                     CanDemoteWorkgroupToWavefront, hasAVNoneMMRA(*MI));
 }
 
 std::optional<SIMemOpInfo>
@@ -946,7 +957,7 @@ SIMemOpAccess::getAtomicFenceInfo(const MachineBasicBlock::iterator &MI) const {
   return SIMemOpInfo(ST, Ordering, Scope, OrderingAddrSpace,
                      SIAtomicAddrSpace::ATOMIC, IsCrossAddressSpaceOrdering,
                      AtomicOrdering::NotAtomic, false, false, false, false,
-                     CanDemoteWorkgroupToWavefront);
+                     CanDemoteWorkgroupToWavefront, hasAVNoneMMRA(*MI));
 }
 
 std::optional<SIMemOpInfo> SIMemOpAccess::getAtomicCmpxchgOrRmwInfo(
@@ -2317,9 +2328,10 @@ bool SIMemoryLegalizer::expandLoad(const SIMemOpInfo &MOI,
           CC->insertWait(MI, MOI.getScope(), MOI.getInstrAddrSpace(),
                          SIMemOp::LOAD, MOI.getIsCrossAddressSpaceOrdering(),
                          Position::AFTER, Order, /*AtomicsOnly=*/true);
-      Changed |= CC->insertAcquire(MI, MOI.getScope(),
-                                   MOI.getOrderingAddrSpace(),
-                                   Position::AFTER);
+      if (!MOI.isAVNone()) {
+        Changed |= CC->insertAcquire(
+            MI, MOI.getScope(), MOI.getOrderingAddrSpace(), Position::AFTER);
+      }
     }
 
     return Changed;
@@ -2363,11 +2375,12 @@ bool SIMemoryLegalizer::expandStore(const SIMemOpInfo &MOI,
       Changed |= CC->handleCooperativeAtomic(*MI);
 
     if (MOI.getOrdering() == AtomicOrdering::Release ||
-        MOI.getOrdering() == AtomicOrdering::SequentiallyConsistent)
-      Changed |= CC->insertRelease(MI, MOI.getScope(),
-                                   MOI.getOrderingAddrSpace(),
-                                   MOI.getIsCrossAddressSpaceOrdering(),
-                                   Position::BEFORE);
+        MOI.getOrdering() == AtomicOrdering::SequentiallyConsistent) {
+      Changed |=
+          CC->insertRelease(MI, MOI.getScope(), MOI.getOrderingAddrSpace(),
+                            MOI.getIsCrossAddressSpaceOrdering(),
+                            Position::BEFORE, MOI.isAVNone());
+    }
 
     Changed |= CC->finalizeStore(StoreMI, /*Atomic=*/true);
     return Changed;
@@ -2412,7 +2425,7 @@ bool SIMemoryLegalizer::expandAtomicFence(const SIMemOpInfo &MOI,
 
     if (Order == AtomicOrdering::Release ||
         Order == AtomicOrdering::AcquireRelease ||
-        Order == AtomicOrdering::SequentiallyConsistent)
+        Order == AtomicOrdering::SequentiallyConsistent) {
       /// TODO: This relies on a barrier always generating a waitcnt
       /// for LDS to ensure it is not reordered with the completion of
       /// the proceeding LDS operations. If barrier had a memory
@@ -2422,18 +2435,21 @@ bool SIMemoryLegalizer::expandAtomicFence(const SIMemOpInfo &MOI,
       /// adding S_WAITCNT before a S_BARRIER.
       Changed |= CC->insertRelease(MI, MOI.getScope(), OrderingAddrSpace,
                                    MOI.getIsCrossAddressSpaceOrdering(),
-                                   Position::BEFORE);
+                                   Position::BEFORE, MOI.isAVNone());
+    }
 
     // TODO: If both release and invalidate are happening they could be combined
     // to use the single "BUFFER_WBINV*" instruction. This could be done by
     // reorganizing this code or as part of optimizing SIInsertWaitcnt pass to
     // track cache invalidate and write back instructions.
 
-    if (Order == AtomicOrdering::Acquire ||
-        Order == AtomicOrdering::AcquireRelease ||
-        Order == AtomicOrdering::SequentiallyConsistent)
+    if ((Order == AtomicOrdering::Acquire ||
+         Order == AtomicOrdering::AcquireRelease ||
+         Order == AtomicOrdering::SequentiallyConsistent) &&
+        !MOI.isAVNone()) {
       Changed |= CC->insertAcquire(MI, MOI.getScope(), OrderingAddrSpace,
                                    Position::BEFORE);
+    }
 
     return Changed;
   }
@@ -2469,11 +2485,12 @@ bool SIMemoryLegalizer::expandAtomicCmpxchgOrRmw(const SIMemOpInfo &MOI,
     if (Order == AtomicOrdering::Release ||
         Order == AtomicOrdering::AcquireRelease ||
         Order == AtomicOrdering::SequentiallyConsistent ||
-        MOI.getFailureOrdering() == AtomicOrdering::SequentiallyConsistent)
-      Changed |= CC->insertRelease(MI, MOI.getScope(),
-                                   MOI.getOrderingAddrSpace(),
-                                   MOI.getIsCrossAddressSpaceOrdering(),
-                                   Position::BEFORE);
+        MOI.getFailureOrdering() == AtomicOrdering::SequentiallyConsistent) {
+      Changed |=
+          CC->insertRelease(MI, MOI.getScope(), MOI.getOrderingAddrSpace(),
+                            MOI.getIsCrossAddressSpaceOrdering(),
+                            Position::BEFORE, MOI.isAVNone());
+    }
 
     if (Order == AtomicOrdering::Acquire ||
         Order == AtomicOrdering::AcquireRelease ||
@@ -2486,9 +2503,10 @@ bool SIMemoryLegalizer::expandAtomicCmpxchgOrRmw(const SIMemOpInfo &MOI,
                          isAtomicRet(*MI) ? SIMemOp::LOAD : SIMemOp::STORE,
                          MOI.getIsCrossAddressSpaceOrdering(), Position::AFTER,
                          Order, /*AtomicsOnly=*/true);
-      Changed |= CC->insertAcquire(MI, MOI.getScope(),
-                                   MOI.getOrderingAddrSpace(),
-                                   Position::AFTER);
+      if (!MOI.isAVNone()) {
+        Changed |= CC->insertAcquire(
+            MI, MOI.getScope(), MOI.getOrderingAddrSpace(), Position::AFTER);
+      }
     }
 
     Changed |= CC->finalizeStore(RMWMI, /*Atomic=*/true);
diff --git a/llvm/test/CodeGen/AMDGPU/memory-legalizer-av-none.ll b/llvm/test/CodeGen/AMDGPU/memory-legalizer-av-none.ll
new file mode 100644
index 0000000000000..89230fe1b7cdd
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/memory-legalizer-av-none.ll
@@ -0,0 +1,722 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -O0 -mcpu=gfx90a < %s | FileCheck -check-prefixes=GFX90A %s
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -O0 -mcpu=gfx90a -mattr=+tgsplit < %s | FileCheck -check-prefixes=GFX90A-TGSPLIT %s
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -O0 -mcpu=gfx1200 < %s | FileCheck --check-prefixes=GFX12-WGP %s
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -O0 -mcpu=gfx1200 -mattr=+cumode < %s | FileCheck --check-prefixes=GFX12-CU %s
+
+; Test that !amdgcn-av-none suppresses MakeAvailable/MakeVisible (cache
+; writeback/invalidation) while preserving ordering (waits).
+
+; Fences: one per scope, varying orderings.
+
+define amdgpu_kernel void @workgroup_acq_rel_fence_av_none() {
+; GFX90A-LABEL: workgroup_acq_rel_fence_av_none:
+; GFX90A:       ; %bb.0: ; %entry
+; GFX90A-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX90A-NEXT:    s_endpgm
+;
+; GFX90A-TGSPLIT-LABEL: workgroup_acq_rel_fence_av_none:
+; GFX90A-TGSPLIT:       ; %bb.0: ; %entry
+; GFX90A-TGSPLIT-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-TGSPLIT-NEXT:    s_endpgm
+;
+; GFX12-WGP-LABEL: workgroup_acq_rel_fence_av_none:
+; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    s_wait_bvhcnt 0x0
+; GFX12-WGP-NEXT:    s_wait_samplecnt 0x0
+; GFX12-WGP-NEXT:    s_wait_storecnt 0x0
+; GFX12-WGP-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GFX12-WGP-NEXT:    s_endpgm
+;
+; GFX12-CU-LABEL: workgroup_acq_rel_fence_av_none:
+; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    s_wait_bvhcnt 0x0
+; GFX12-CU-NEXT:    s_wait_samplecnt 0x0
+; GFX12-CU-NEXT:    s_wait_storecnt 0x0
+; GFX12-CU-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GFX12-CU-NEXT:    s_endpgm
+entry:
+  fence syncscope("workgroup") acq_rel, !mmra !0
+  ret void
+}
+
+define amdgpu_kernel void @cluster_seq_cst_fence_av_none() {
+; GFX90A-LABEL: cluster_seq_cst_fence_av_none:
+; GFX90A:       ; %bb.0: ; %entry
+; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-NEXT:    s_endpgm
+;
+; GFX90A-TGSPLIT-LABEL: cluster_seq_cst_fence_av_none:
+; GFX90A-TGSPLIT:       ; %bb.0: ; %entry
+; GFX90A-TGSPLIT-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-TGSPLIT-NEXT:    s_endpgm
+;
+; GFX12-WGP-LABEL: cluster_seq_cst_fence_av_none:
+; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    s_wait_bvhcnt 0x0
+; GFX12-WGP-NEXT:    s_wait_samplecnt 0x0
+; GFX12-WGP-NEXT:    s_wait_storecnt 0x0
+; GFX12-WGP-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GFX12-WGP-NEXT:    s_endpgm
+;
+; GFX12-CU-LABEL: cluster_seq_cst_fence_av_none:
+; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    s_wait_bvhcnt 0x0
+; GFX12-CU-NEXT:    s_wait_samplecnt 0x0
+; GFX12-CU-NEXT:    s_wait_storecnt 0x0
+; GFX12-CU-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GFX12-CU-NEXT:    s_endpgm
+entry:
+  fence syncscope("cluster") seq_cst, !mmra !0
+  ret void
+}
+
+define amdgpu_kernel void @agent_acquire_fence_av_none() {
+; GFX90A-LABEL: agent_acquire_fence_av_none:
+; GFX90A:       ; %bb.0: ; %entry
+; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-NEXT:    s_endpgm
+;
+; GFX90A-TGSPLIT-LABEL: agent_acquire_fence_av_none:
+; GFX90A-TGSPLIT:       ; %bb.0: ; %entry
+; GFX90A-TGSPLIT-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-TGSPLIT-NEXT:    s_endpgm
+;
+; GFX12-WGP-LABEL: agent_acquire_fence_av_none:
+; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    s_wait_storecnt 0x0
+; GFX12-WGP-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GFX12-WGP-NEXT:    s_endpgm
+;
+; GFX12-CU-LABEL: agent_acquire_fence_av_none:
+; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    s_wait_storecnt 0x0
+; GFX12-CU-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GFX12-CU-NEXT:    s_endpgm
+entry:
+  fence syncscope("agent") acquire, !mmra !0
+  ret void
+}
+
+define amdgpu_kernel void @agent_release_fence_av_none() {
+; GFX90A-LABEL: agent_release_fence_av_none:
+; GFX90A:       ; %bb.0: ; %entry
+; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-NEXT:    s_endpgm
+;
+; GFX90A-TGSPLIT-LABEL: agent_release_fence_av_none:
+; GFX90A-TGSPLIT:       ; %bb.0: ; %entry
+; GFX90A-TGSPLIT-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-TGSPLIT-NEXT:    s_endpgm
+;
+; GFX12-WGP-LABEL: agent_release_fence_av_none:
+; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    s_wait_bvhcnt 0x0
+; GFX12-WGP-NEXT:    s_wait_samplecnt 0x0
+; GFX12-WGP-NEXT:    s_wait_storecnt 0x0
+; GFX12-WGP-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GFX12-WGP-NEXT:    s_endpgm
+;
+; GFX12-CU-LABEL: agent_release_fence_av_none:
+; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    s_wait_bvhcnt 0x0
+; GFX12-CU-NEXT:    s_wait_samplecnt 0x0
+; GFX12-CU-NEXT:    s_wait_storecnt 0x0
+; GFX12-CU-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GFX12-CU-NEXT:    s_endpgm
+entry:
+  fence syncscope("agent") release, !mmra !0
+  ret void
+}
+
+define amdgpu_kernel void @system_seq_cst_fence_av_none() {
+; GFX90A-LABEL: system_seq_cst_fence_av_none:
+; GFX90A:       ; %bb.0: ; %entry
+; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-NEXT:    s_endpgm
+;
+; GFX90A-TGSPLIT-LABEL: system_seq_cst_fence_av_none:
+; GFX90A-TGSPLIT:       ; %bb.0: ; %entry
+; GFX90A-TGSPLIT-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-TGSPLIT-NEXT:    s_endpgm
+;
+; GFX12-WGP-LABEL: system_seq_cst_fence_av_none:
+; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    s_wait_bvhcnt 0x0
+; GFX12-WGP-NEXT:    s_wait_samplecnt 0x0
+; GFX12-WGP-NEXT:    s_wait_storecnt 0x0
+; GFX12-WGP-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GFX12-WGP-NEXT:    s_endpgm
+;
+; GFX12-CU-LABEL: system_seq_cst_fence_av_none:
+; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    s_wait_bvhcnt 0x0
+; GFX12-CU-NEXT:    s_wait_samplecnt 0x0
+; GFX12-CU-NEXT:    s_wait_storecnt 0x0
+; GFX12-CU-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GFX12-CU-NEXT:    s_endpgm
+entry:
+  fence seq_cst, !mmra !0
+  ret void
+}
+
+; Atomic loads: acquire across scopes.
+
+define i32 @workgroup_acquire_load_av_none(ptr addrspace(1) %ptr) {
+; GFX90A-LABEL: workgroup_acquire_load_av_none:
+; GFX90A:       ; %bb.0: ; %entry
+; GFX90A-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX90A-NEXT:    v_mov_b32_e32 v2, v1
+; GFX90A-NEXT:    ; kill: def $vgpr0 killed $vgpr0 def $vgpr0_vgpr1 killed $exec
+; GFX90A-NEXT:    v_mov_b32_e32 v1, v2
+; GFX90A-NEXT:    global_load_dword v0, v[0:1], off
+; GFX90A-NEXT:    s_waitcnt vmcnt(0)
+; GFX90A-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX90A-TGSPLIT-LABEL: workgroup_acquire_load_av_none:
+; GFX90A-TGSPLIT:       ; %bb.0: ; %entry
+; GFX90A-TGSPLIT-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX90A-TGSPLIT-NEXT:    v_mov_b32_e32 v2, v1
+; GFX90A-TGSPLIT-NEXT:    ; kill: def $vgpr0 killed $vgpr0 def $vgpr0_vgpr1 killed $exec
+; GFX90A-TGSPLIT-NEXT:    v_mov_b32_e32 v1, v2
+; GFX90A-TGSPLIT-NEXT:    global_load_dword v0, v[0:1], off glc
+; GFX90A-TGSPLIT-NEXT:    s_waitcnt vmcnt(0)
+; GFX90A-TGSPLIT-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX12-WGP-LABEL: workgroup_acquire_load_av_none:
+; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GFX12-WGP-NEXT:    s_wait_expcnt 0x0
+; GFX12-WGP-NEXT:    s_wait_samplecnt 0x0
+; GFX12-WGP-NEXT:    s_wait_bvhcnt 0x0
+; GFX12-WGP-NEXT:    s_wait_kmcnt 0x0
+; GFX12-WGP-NEXT:    v_mov_b32_e32 v2, v1
+; GFX12-WGP-NEXT:    ; kill: def $vgpr0 killed $vgpr0 def $vgpr0_vgpr1 killed $exec
+; GFX12-WGP-NEXT:    v_mov_b32_e32 v1, v2
+; GFX12-WGP-NEXT:    global_load_b32 v0, v[0:1], off scope:SCOPE_SE
+; GFX12-WGP-NEXT:    s_wait_loadcnt 0x0
+; GFX12-WGP-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX12-CU-LABEL: workgroup_acquire_load_av_none:
+; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GFX12-CU-NEXT:    s_wait_expcnt 0x0
+; GFX12-CU-NEXT:    s_wait_samplecnt 0x0
+; GFX12-CU-NEXT:    s_wait_bvhcnt 0x0
+; GFX12-CU-NEXT:    s_wait_kmcnt 0x0
+; GFX12-CU-NEXT:    v_mov_b32_e32 v2, v1
+; GFX12-CU-NEXT:    ; kill: def $vgpr0 killed $vgpr0 def $vgpr0_vgpr1 killed $exec
+; GFX12-CU-NEXT:    v_mov_b32_e32 v1, v2
+; GFX12-CU-NEXT:    global_load_b32 v0, v[0:1], off
+; GFX12-CU-NEXT:    s_wait_loadcnt 0x0
+; GFX12-CU-NEXT:    s_setpc_b64 s[30:31]
+entry:
+  %val = load atomic i32, ptr addrspace(1) %ptr syncscope("workgroup") acquire, align 4, !mmra !0
+  ret i32 %val
+}
+
+define i32 @agent_acquire_load_av_none(ptr addrspace(1) %ptr) {
+; GFX90A-LABEL: agent_acquire_load_av_none:
+; GFX90A:       ; %bb.0: ; %entry
+; GFX90A-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX90A-NEXT:    v_mov_b32_e32 v2, v1
+; GFX90A-NEXT:    ; kill: def $vgpr0 killed $vgpr0 def $vgpr0_vgpr1 killed $exec
+; GFX90A-NEXT:    v_mov_b32_e32 v1, v2
+; GFX90A-NEXT:    global_load_dword v0, v[0:1], off glc
+; GFX90A-NEXT:    s_waitcnt vmcnt(0)
+; GFX90A-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX90A-TGSPLIT-LABEL: agent_acquire_load_av_none:
+; GFX90A-TGSPLIT:       ; %bb.0: ; %entry
+; GFX90A-TGSPLIT-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX90A-TGSPLIT-NEXT:    v_mov_b32_e32 v2, v1
+; GFX90A-TGSPLIT-NEXT:    ; kill: def $vgpr0 killed $vgpr0 def $vgpr0_vgpr1 killed $exec
+; GFX90A-TGSPLIT-NEXT:    v_mov_b32_e32 v1, v2
+; GFX90A-TGSPLIT-NEXT:    global_load_dword v0, v[0:1], off glc
+; GFX90A-TGSPLIT-NEXT:    s_waitcnt vmcnt(0)
+; GFX90A-TGSPLIT-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX12-WGP-LABEL: agent_acquire_load_av_none:
+; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GFX12-WGP-NEXT:    s_wait_expcnt 0x0
+; GFX12-WGP-NEXT:    s_wait_samplecnt 0x0
+; GFX12-WGP-NEXT:    s_wait_bvhcnt 0x0
+; GFX12-WGP-NEXT:    s_wait_kmcnt 0x0
+; GFX12-WGP-NEXT:    v_mov_b32_e32 v2, v1
+; GFX12-WGP-NEXT:    ; kill: def $vgpr0 killed $vgpr0 def $vgpr0_vgpr1 killed $exec
+; GFX12-WGP-NEXT:    v_mov_b32_e32 v1, v2
+; GFX12-WGP-NEXT:    global_load_b32 v0, v[0:1], off scope:SCOPE_DEV
+; GFX12-WGP-NEXT:    s_wait_loadcnt 0x0
+; GFX12-WGP-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX12-CU-LABEL: agent_acquire_load_av_none:
+; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GFX12-CU-NEXT:    s_wait_expcnt 0x0
+; GFX12-CU-NEXT:    s_wait_samplecnt 0x0
+; GFX12-CU-NEXT:    s_wait_bvhcnt 0x0
+; GFX12-CU-NEXT:    s_wait_kmcnt 0x0
+; GFX12-CU-NEXT:    v_mov_b32_e32 v2, v1
+; GFX12-CU-NEXT:    ; kill: def $vgpr0 killed $vgpr0 def $vgpr0_vgpr1 killed $exec
+; GFX12-CU-NEXT:    v_mov_b32_e32 v1, v2
+; GFX12-CU-NEXT:    global_load_b32 v0, v[0:1], off scope:SCOPE_DEV
+; GFX12-CU-NEXT:    s_wait_loadcnt 0x0
+; GFX12-CU-NEXT:    s_setpc_b64 s[30:31]
+entry:
+  %val = load atomic i32, ptr addrspace(1) %ptr syncscope("agent") acquire, align 4, !mmra !0
+  ret i32 %val
+}
+
+define i32 @system_acquire_load_av_none(ptr addrspace(1) %ptr) {
+; GFX90A-LABEL: system_acquire_load_av_none:
+; GFX90A:       ; %bb.0: ; %entry
+; GFX90A-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX90A-NEXT:    v_mov_b32_e32 v2, v1
+; GFX90A-NEXT:    ; ...
[truncated]

auto MMRA = MMRAMetadata(MI.getMMRAMetadata());
if (!MMRA)
return false;
return MMRA.hasTag("amdgcn-av", "none");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the tag name be amdgpu-av, to be more consistent with the existing amdgpu-synchronize-as?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: should we diagnose values other than "none"? There is no case where we want to make use of the happens-before-breaking semantics of incompatible MMRAs, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the tag name be amdgpu-av, to be more consistent with the existing amdgpu-synchronize-as?

It's something I had explored. amdgpu-synchronize-as is one of the rare places where "amdgpu" is used, while in the case of almost all builtins, intrinsics and metadata, "amdgcn" is the convention. @Pierre-vh were you trying to start a newer convention with the amdgpu-synchronize-as?

Also: should we diagnose values other than "none"? There is no case where we want to make use of the happens-before-breaking semantics of incompatible MMRAs, right?

Yes, I will add the check.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: should we diagnose values other than "none"? There is no case where we want to make use of the happens-before-breaking semantics of incompatible MMRAs, right?

I remember I asked similar question somewhere else but can't find it. What should be the correct way of handling metadata verification? In the IR verifier or where they are being used?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was struggling with that when validating "!amdgcn-av !none". For now I just copied what is done for "!amdgpu-synchronize-as", which is to validate it at the time of consumption. It would be nice to separately work on an AMDGPU metadata verifier plugged into the IR verifier.

Co-authored-by: Pierre van Houtryve <pierre.vanhoutryve@amd.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 27, 2026

🐧 Linux x64 Test Results

  • 195898 tests passed
  • 5274 tests skipped

✅ The build succeeded and all tests passed.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 27, 2026

🪟 Windows x64 Test Results

  • 135191 tests passed
  • 3333 tests skipped

✅ The build succeeded and all tests passed.

Base automatically changed from users/ssahasra/refactor-acq-rel to main May 27, 2026 14:38
ssahasra added a commit that referenced this pull request May 27, 2026
…C) (#199486)

A release consists of two actions: write-back the current cache, and
wait for "relevant" outstanding operations to complete. With the new
memory model, it is possible to disable the cache write-back using
"non-av". This patch cleanly separates the existing implementation so
that the write-backs can be selectively applied after checking for
non-av semantics.

Part of a stack:

- #199486
- #199621
- #199489 
- #199622

Assisted-By: Claude Opus 4.6

---------

Co-authored-by: Pierre van Houtryve <pierre.vanhoutryve@amd.com>
Copy link
Copy Markdown
Contributor

@RyanRio RyanRio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't see any issues with this. You could also add a test that combines the synchronize-as metadata with the av metadata, for applicable areas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants