[AMDGPU] Implement "non-av" semantics using metadata#199489
Conversation
A release consists of two actions: write-back the current cache, and wait for "relevant" outstanding operations to complete. With the new memory model, it is possible to disable the cache write-back using "av none" semantics. This patch cleanly separates the existing implementation so that the write-backs can be selectively applied when such metadata is present. Assisted-By: Claude Opus 4.6
When the MMRA tag !{!"amdgcn-av", !"none"} is present on a synchronization
operation (fence, atomic load/store/rmw/cmpxchg), suppress cache writeback
(MakeAvailable) and cache invalidation (MakeVisible) while preserving
memory ordering (waits).
This implements the metadata proposed in #191246.
Fixes: LCOMPILER-2214
Assisted-By: Claude Opus 4.6
|
@llvm/pr-subscribers-backend-amdgpu Author: Sameer Sahasrabuddhe (ssahasra) ChangesWhen the MMRA tag !{!"amdgcn-av", !"none"} is present on a synchronization operation (fence, atomic load/store/rmw/cmpxchg), suppress cache writeback (MakeAvailable) and cache invalidation (MakeVisible) while preserving memory ordering (waits). This implements the metadata proposed in #191246. Fixes: LCOMPILER-2214 Assisted-By: Claude Opus 4.6 Patch is 39.91 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/199489.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
index b721dcaf49d0f..f16192d343531 100644
--- a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
@@ -148,6 +148,7 @@ class SIMemOpInfo final {
bool IsNonTemporal = false;
bool IsLastUse = false;
bool IsCooperative = false;
+ bool IsAVNone = false;
// TODO: Should we assume Cooperative=true if no MMO is present?
SIMemOpInfo(
@@ -160,12 +161,12 @@ class SIMemOpInfo final {
AtomicOrdering FailureOrdering = AtomicOrdering::SequentiallyConsistent,
bool IsVolatile = false, bool IsNonTemporal = false,
bool IsLastUse = false, bool IsCooperative = false,
- bool CanDemoteWorkgroupToWavefront = false)
+ bool CanDemoteWorkgroupToWavefront = false, bool IsAVNone = false)
: Ordering(Ordering), FailureOrdering(FailureOrdering), Scope(Scope),
OrderingAddrSpace(OrderingAddrSpace), InstrAddrSpace(InstrAddrSpace),
IsCrossAddressSpaceOrdering(IsCrossAddressSpaceOrdering),
IsVolatile(IsVolatile), IsNonTemporal(IsNonTemporal),
- IsLastUse(IsLastUse), IsCooperative(IsCooperative) {
+ IsLastUse(IsLastUse), IsCooperative(IsCooperative), IsAVNone(IsAVNone) {
if (Ordering == AtomicOrdering::NotAtomic) {
assert(!IsCooperative && "Cannot be cooperative & non-atomic!");
@@ -277,6 +278,9 @@ class SIMemOpInfo final {
/// \returns True if this is a cooperative load or store atomic.
bool isCooperative() const { return IsCooperative; }
+ /// \returns True if MakeAvailable/MakeVisible should be suppressed.
+ bool isAVNone() const { return IsAVNone; }
+
/// \returns True if ordering constraint of the machine instruction used to
/// create this SIMemOpInfo is unordered or higher, false otherwise.
bool isAtomic() const {
@@ -451,13 +455,13 @@ class SICacheControl {
SIAtomicScope Scope, SIAtomicAddrSpace AddrSpace,
Position Pos) const = 0;
- /// Inserts writeback followed by an unconditional wait to implement a
- /// release operation.
+ /// Inserts writeback (unless \p IsAVNone) followed by an unconditional wait.
bool insertRelease(MachineBasicBlock::iterator &MI, SIAtomicScope Scope,
SIAtomicAddrSpace AddrSpace, bool IsCrossAddrSpaceOrdering,
- Position Pos) const {
+ Position Pos, bool IsAVNone) const {
bool Changed = false;
- Changed |= insertWriteback(MI, Scope, AddrSpace, Pos);
+ if (!IsAVNone)
+ Changed |= insertWriteback(MI, Scope, AddrSpace, Pos);
Changed |= insertWait(MI, Scope, AddrSpace, SIMemOp::LOAD | SIMemOp::STORE,
IsCrossAddrSpaceOrdering, Pos,
AtomicOrdering::Release, /*AtomicsOnly=*/false);
@@ -733,6 +737,13 @@ getSynchronizeAddrSpaceMD(const MachineInstr &MI) {
return Result;
}
+static bool hasAVNoneMMRA(const MachineInstr &MI) {
+ auto MMRA = MMRAMetadata(MI.getMMRAMetadata());
+ if (!MMRA)
+ return false;
+ return MMRA.hasTag("amdgcn-av", "none");
+}
+
} // end anonymous namespace
void SIMemOpAccess::reportUnsupported(const MachineBasicBlock::iterator &MI,
@@ -876,7 +887,7 @@ std::optional<SIMemOpInfo> SIMemOpAccess::constructFromMIWithMMO(
return SIMemOpInfo(ST, Ordering, Scope, OrderingAddrSpace, InstrAddrSpace,
IsCrossAddressSpaceOrdering, FailureOrdering, IsVolatile,
IsNonTemporal, IsLastUse, IsCooperative,
- CanDemoteWorkgroupToWavefront);
+ CanDemoteWorkgroupToWavefront, hasAVNoneMMRA(*MI));
}
std::optional<SIMemOpInfo>
@@ -946,7 +957,7 @@ SIMemOpAccess::getAtomicFenceInfo(const MachineBasicBlock::iterator &MI) const {
return SIMemOpInfo(ST, Ordering, Scope, OrderingAddrSpace,
SIAtomicAddrSpace::ATOMIC, IsCrossAddressSpaceOrdering,
AtomicOrdering::NotAtomic, false, false, false, false,
- CanDemoteWorkgroupToWavefront);
+ CanDemoteWorkgroupToWavefront, hasAVNoneMMRA(*MI));
}
std::optional<SIMemOpInfo> SIMemOpAccess::getAtomicCmpxchgOrRmwInfo(
@@ -2317,9 +2328,10 @@ bool SIMemoryLegalizer::expandLoad(const SIMemOpInfo &MOI,
CC->insertWait(MI, MOI.getScope(), MOI.getInstrAddrSpace(),
SIMemOp::LOAD, MOI.getIsCrossAddressSpaceOrdering(),
Position::AFTER, Order, /*AtomicsOnly=*/true);
- Changed |= CC->insertAcquire(MI, MOI.getScope(),
- MOI.getOrderingAddrSpace(),
- Position::AFTER);
+ if (!MOI.isAVNone()) {
+ Changed |= CC->insertAcquire(
+ MI, MOI.getScope(), MOI.getOrderingAddrSpace(), Position::AFTER);
+ }
}
return Changed;
@@ -2363,11 +2375,12 @@ bool SIMemoryLegalizer::expandStore(const SIMemOpInfo &MOI,
Changed |= CC->handleCooperativeAtomic(*MI);
if (MOI.getOrdering() == AtomicOrdering::Release ||
- MOI.getOrdering() == AtomicOrdering::SequentiallyConsistent)
- Changed |= CC->insertRelease(MI, MOI.getScope(),
- MOI.getOrderingAddrSpace(),
- MOI.getIsCrossAddressSpaceOrdering(),
- Position::BEFORE);
+ MOI.getOrdering() == AtomicOrdering::SequentiallyConsistent) {
+ Changed |=
+ CC->insertRelease(MI, MOI.getScope(), MOI.getOrderingAddrSpace(),
+ MOI.getIsCrossAddressSpaceOrdering(),
+ Position::BEFORE, MOI.isAVNone());
+ }
Changed |= CC->finalizeStore(StoreMI, /*Atomic=*/true);
return Changed;
@@ -2412,7 +2425,7 @@ bool SIMemoryLegalizer::expandAtomicFence(const SIMemOpInfo &MOI,
if (Order == AtomicOrdering::Release ||
Order == AtomicOrdering::AcquireRelease ||
- Order == AtomicOrdering::SequentiallyConsistent)
+ Order == AtomicOrdering::SequentiallyConsistent) {
/// TODO: This relies on a barrier always generating a waitcnt
/// for LDS to ensure it is not reordered with the completion of
/// the proceeding LDS operations. If barrier had a memory
@@ -2422,18 +2435,21 @@ bool SIMemoryLegalizer::expandAtomicFence(const SIMemOpInfo &MOI,
/// adding S_WAITCNT before a S_BARRIER.
Changed |= CC->insertRelease(MI, MOI.getScope(), OrderingAddrSpace,
MOI.getIsCrossAddressSpaceOrdering(),
- Position::BEFORE);
+ Position::BEFORE, MOI.isAVNone());
+ }
// TODO: If both release and invalidate are happening they could be combined
// to use the single "BUFFER_WBINV*" instruction. This could be done by
// reorganizing this code or as part of optimizing SIInsertWaitcnt pass to
// track cache invalidate and write back instructions.
- if (Order == AtomicOrdering::Acquire ||
- Order == AtomicOrdering::AcquireRelease ||
- Order == AtomicOrdering::SequentiallyConsistent)
+ if ((Order == AtomicOrdering::Acquire ||
+ Order == AtomicOrdering::AcquireRelease ||
+ Order == AtomicOrdering::SequentiallyConsistent) &&
+ !MOI.isAVNone()) {
Changed |= CC->insertAcquire(MI, MOI.getScope(), OrderingAddrSpace,
Position::BEFORE);
+ }
return Changed;
}
@@ -2469,11 +2485,12 @@ bool SIMemoryLegalizer::expandAtomicCmpxchgOrRmw(const SIMemOpInfo &MOI,
if (Order == AtomicOrdering::Release ||
Order == AtomicOrdering::AcquireRelease ||
Order == AtomicOrdering::SequentiallyConsistent ||
- MOI.getFailureOrdering() == AtomicOrdering::SequentiallyConsistent)
- Changed |= CC->insertRelease(MI, MOI.getScope(),
- MOI.getOrderingAddrSpace(),
- MOI.getIsCrossAddressSpaceOrdering(),
- Position::BEFORE);
+ MOI.getFailureOrdering() == AtomicOrdering::SequentiallyConsistent) {
+ Changed |=
+ CC->insertRelease(MI, MOI.getScope(), MOI.getOrderingAddrSpace(),
+ MOI.getIsCrossAddressSpaceOrdering(),
+ Position::BEFORE, MOI.isAVNone());
+ }
if (Order == AtomicOrdering::Acquire ||
Order == AtomicOrdering::AcquireRelease ||
@@ -2486,9 +2503,10 @@ bool SIMemoryLegalizer::expandAtomicCmpxchgOrRmw(const SIMemOpInfo &MOI,
isAtomicRet(*MI) ? SIMemOp::LOAD : SIMemOp::STORE,
MOI.getIsCrossAddressSpaceOrdering(), Position::AFTER,
Order, /*AtomicsOnly=*/true);
- Changed |= CC->insertAcquire(MI, MOI.getScope(),
- MOI.getOrderingAddrSpace(),
- Position::AFTER);
+ if (!MOI.isAVNone()) {
+ Changed |= CC->insertAcquire(
+ MI, MOI.getScope(), MOI.getOrderingAddrSpace(), Position::AFTER);
+ }
}
Changed |= CC->finalizeStore(RMWMI, /*Atomic=*/true);
diff --git a/llvm/test/CodeGen/AMDGPU/memory-legalizer-av-none.ll b/llvm/test/CodeGen/AMDGPU/memory-legalizer-av-none.ll
new file mode 100644
index 0000000000000..89230fe1b7cdd
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/memory-legalizer-av-none.ll
@@ -0,0 +1,722 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -O0 -mcpu=gfx90a < %s | FileCheck -check-prefixes=GFX90A %s
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -O0 -mcpu=gfx90a -mattr=+tgsplit < %s | FileCheck -check-prefixes=GFX90A-TGSPLIT %s
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -O0 -mcpu=gfx1200 < %s | FileCheck --check-prefixes=GFX12-WGP %s
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -O0 -mcpu=gfx1200 -mattr=+cumode < %s | FileCheck --check-prefixes=GFX12-CU %s
+
+; Test that !amdgcn-av-none suppresses MakeAvailable/MakeVisible (cache
+; writeback/invalidation) while preserving ordering (waits).
+
+; Fences: one per scope, varying orderings.
+
+define amdgpu_kernel void @workgroup_acq_rel_fence_av_none() {
+; GFX90A-LABEL: workgroup_acq_rel_fence_av_none:
+; GFX90A: ; %bb.0: ; %entry
+; GFX90A-NEXT: s_waitcnt lgkmcnt(0)
+; GFX90A-NEXT: s_endpgm
+;
+; GFX90A-TGSPLIT-LABEL: workgroup_acq_rel_fence_av_none:
+; GFX90A-TGSPLIT: ; %bb.0: ; %entry
+; GFX90A-TGSPLIT-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-TGSPLIT-NEXT: s_endpgm
+;
+; GFX12-WGP-LABEL: workgroup_acq_rel_fence_av_none:
+; GFX12-WGP: ; %bb.0: ; %entry
+; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
+; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
+; GFX12-WGP-NEXT: s_wait_storecnt 0x0
+; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
+; GFX12-WGP-NEXT: s_endpgm
+;
+; GFX12-CU-LABEL: workgroup_acq_rel_fence_av_none:
+; GFX12-CU: ; %bb.0: ; %entry
+; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
+; GFX12-CU-NEXT: s_wait_samplecnt 0x0
+; GFX12-CU-NEXT: s_wait_storecnt 0x0
+; GFX12-CU-NEXT: s_wait_loadcnt_dscnt 0x0
+; GFX12-CU-NEXT: s_endpgm
+entry:
+ fence syncscope("workgroup") acq_rel, !mmra !0
+ ret void
+}
+
+define amdgpu_kernel void @cluster_seq_cst_fence_av_none() {
+; GFX90A-LABEL: cluster_seq_cst_fence_av_none:
+; GFX90A: ; %bb.0: ; %entry
+; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-NEXT: s_endpgm
+;
+; GFX90A-TGSPLIT-LABEL: cluster_seq_cst_fence_av_none:
+; GFX90A-TGSPLIT: ; %bb.0: ; %entry
+; GFX90A-TGSPLIT-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-TGSPLIT-NEXT: s_endpgm
+;
+; GFX12-WGP-LABEL: cluster_seq_cst_fence_av_none:
+; GFX12-WGP: ; %bb.0: ; %entry
+; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
+; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
+; GFX12-WGP-NEXT: s_wait_storecnt 0x0
+; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
+; GFX12-WGP-NEXT: s_endpgm
+;
+; GFX12-CU-LABEL: cluster_seq_cst_fence_av_none:
+; GFX12-CU: ; %bb.0: ; %entry
+; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
+; GFX12-CU-NEXT: s_wait_samplecnt 0x0
+; GFX12-CU-NEXT: s_wait_storecnt 0x0
+; GFX12-CU-NEXT: s_wait_loadcnt_dscnt 0x0
+; GFX12-CU-NEXT: s_endpgm
+entry:
+ fence syncscope("cluster") seq_cst, !mmra !0
+ ret void
+}
+
+define amdgpu_kernel void @agent_acquire_fence_av_none() {
+; GFX90A-LABEL: agent_acquire_fence_av_none:
+; GFX90A: ; %bb.0: ; %entry
+; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-NEXT: s_endpgm
+;
+; GFX90A-TGSPLIT-LABEL: agent_acquire_fence_av_none:
+; GFX90A-TGSPLIT: ; %bb.0: ; %entry
+; GFX90A-TGSPLIT-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-TGSPLIT-NEXT: s_endpgm
+;
+; GFX12-WGP-LABEL: agent_acquire_fence_av_none:
+; GFX12-WGP: ; %bb.0: ; %entry
+; GFX12-WGP-NEXT: s_wait_storecnt 0x0
+; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
+; GFX12-WGP-NEXT: s_endpgm
+;
+; GFX12-CU-LABEL: agent_acquire_fence_av_none:
+; GFX12-CU: ; %bb.0: ; %entry
+; GFX12-CU-NEXT: s_wait_storecnt 0x0
+; GFX12-CU-NEXT: s_wait_loadcnt_dscnt 0x0
+; GFX12-CU-NEXT: s_endpgm
+entry:
+ fence syncscope("agent") acquire, !mmra !0
+ ret void
+}
+
+define amdgpu_kernel void @agent_release_fence_av_none() {
+; GFX90A-LABEL: agent_release_fence_av_none:
+; GFX90A: ; %bb.0: ; %entry
+; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-NEXT: s_endpgm
+;
+; GFX90A-TGSPLIT-LABEL: agent_release_fence_av_none:
+; GFX90A-TGSPLIT: ; %bb.0: ; %entry
+; GFX90A-TGSPLIT-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-TGSPLIT-NEXT: s_endpgm
+;
+; GFX12-WGP-LABEL: agent_release_fence_av_none:
+; GFX12-WGP: ; %bb.0: ; %entry
+; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
+; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
+; GFX12-WGP-NEXT: s_wait_storecnt 0x0
+; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
+; GFX12-WGP-NEXT: s_endpgm
+;
+; GFX12-CU-LABEL: agent_release_fence_av_none:
+; GFX12-CU: ; %bb.0: ; %entry
+; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
+; GFX12-CU-NEXT: s_wait_samplecnt 0x0
+; GFX12-CU-NEXT: s_wait_storecnt 0x0
+; GFX12-CU-NEXT: s_wait_loadcnt_dscnt 0x0
+; GFX12-CU-NEXT: s_endpgm
+entry:
+ fence syncscope("agent") release, !mmra !0
+ ret void
+}
+
+define amdgpu_kernel void @system_seq_cst_fence_av_none() {
+; GFX90A-LABEL: system_seq_cst_fence_av_none:
+; GFX90A: ; %bb.0: ; %entry
+; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-NEXT: s_endpgm
+;
+; GFX90A-TGSPLIT-LABEL: system_seq_cst_fence_av_none:
+; GFX90A-TGSPLIT: ; %bb.0: ; %entry
+; GFX90A-TGSPLIT-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-TGSPLIT-NEXT: s_endpgm
+;
+; GFX12-WGP-LABEL: system_seq_cst_fence_av_none:
+; GFX12-WGP: ; %bb.0: ; %entry
+; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
+; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
+; GFX12-WGP-NEXT: s_wait_storecnt 0x0
+; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
+; GFX12-WGP-NEXT: s_endpgm
+;
+; GFX12-CU-LABEL: system_seq_cst_fence_av_none:
+; GFX12-CU: ; %bb.0: ; %entry
+; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
+; GFX12-CU-NEXT: s_wait_samplecnt 0x0
+; GFX12-CU-NEXT: s_wait_storecnt 0x0
+; GFX12-CU-NEXT: s_wait_loadcnt_dscnt 0x0
+; GFX12-CU-NEXT: s_endpgm
+entry:
+ fence seq_cst, !mmra !0
+ ret void
+}
+
+; Atomic loads: acquire across scopes.
+
+define i32 @workgroup_acquire_load_av_none(ptr addrspace(1) %ptr) {
+; GFX90A-LABEL: workgroup_acquire_load_av_none:
+; GFX90A: ; %bb.0: ; %entry
+; GFX90A-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX90A-NEXT: v_mov_b32_e32 v2, v1
+; GFX90A-NEXT: ; kill: def $vgpr0 killed $vgpr0 def $vgpr0_vgpr1 killed $exec
+; GFX90A-NEXT: v_mov_b32_e32 v1, v2
+; GFX90A-NEXT: global_load_dword v0, v[0:1], off
+; GFX90A-NEXT: s_waitcnt vmcnt(0)
+; GFX90A-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX90A-TGSPLIT-LABEL: workgroup_acquire_load_av_none:
+; GFX90A-TGSPLIT: ; %bb.0: ; %entry
+; GFX90A-TGSPLIT-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX90A-TGSPLIT-NEXT: v_mov_b32_e32 v2, v1
+; GFX90A-TGSPLIT-NEXT: ; kill: def $vgpr0 killed $vgpr0 def $vgpr0_vgpr1 killed $exec
+; GFX90A-TGSPLIT-NEXT: v_mov_b32_e32 v1, v2
+; GFX90A-TGSPLIT-NEXT: global_load_dword v0, v[0:1], off glc
+; GFX90A-TGSPLIT-NEXT: s_waitcnt vmcnt(0)
+; GFX90A-TGSPLIT-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX12-WGP-LABEL: workgroup_acquire_load_av_none:
+; GFX12-WGP: ; %bb.0: ; %entry
+; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
+; GFX12-WGP-NEXT: s_wait_expcnt 0x0
+; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
+; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
+; GFX12-WGP-NEXT: s_wait_kmcnt 0x0
+; GFX12-WGP-NEXT: v_mov_b32_e32 v2, v1
+; GFX12-WGP-NEXT: ; kill: def $vgpr0 killed $vgpr0 def $vgpr0_vgpr1 killed $exec
+; GFX12-WGP-NEXT: v_mov_b32_e32 v1, v2
+; GFX12-WGP-NEXT: global_load_b32 v0, v[0:1], off scope:SCOPE_SE
+; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
+; GFX12-WGP-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX12-CU-LABEL: workgroup_acquire_load_av_none:
+; GFX12-CU: ; %bb.0: ; %entry
+; GFX12-CU-NEXT: s_wait_loadcnt_dscnt 0x0
+; GFX12-CU-NEXT: s_wait_expcnt 0x0
+; GFX12-CU-NEXT: s_wait_samplecnt 0x0
+; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
+; GFX12-CU-NEXT: s_wait_kmcnt 0x0
+; GFX12-CU-NEXT: v_mov_b32_e32 v2, v1
+; GFX12-CU-NEXT: ; kill: def $vgpr0 killed $vgpr0 def $vgpr0_vgpr1 killed $exec
+; GFX12-CU-NEXT: v_mov_b32_e32 v1, v2
+; GFX12-CU-NEXT: global_load_b32 v0, v[0:1], off
+; GFX12-CU-NEXT: s_wait_loadcnt 0x0
+; GFX12-CU-NEXT: s_setpc_b64 s[30:31]
+entry:
+ %val = load atomic i32, ptr addrspace(1) %ptr syncscope("workgroup") acquire, align 4, !mmra !0
+ ret i32 %val
+}
+
+define i32 @agent_acquire_load_av_none(ptr addrspace(1) %ptr) {
+; GFX90A-LABEL: agent_acquire_load_av_none:
+; GFX90A: ; %bb.0: ; %entry
+; GFX90A-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX90A-NEXT: v_mov_b32_e32 v2, v1
+; GFX90A-NEXT: ; kill: def $vgpr0 killed $vgpr0 def $vgpr0_vgpr1 killed $exec
+; GFX90A-NEXT: v_mov_b32_e32 v1, v2
+; GFX90A-NEXT: global_load_dword v0, v[0:1], off glc
+; GFX90A-NEXT: s_waitcnt vmcnt(0)
+; GFX90A-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX90A-TGSPLIT-LABEL: agent_acquire_load_av_none:
+; GFX90A-TGSPLIT: ; %bb.0: ; %entry
+; GFX90A-TGSPLIT-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX90A-TGSPLIT-NEXT: v_mov_b32_e32 v2, v1
+; GFX90A-TGSPLIT-NEXT: ; kill: def $vgpr0 killed $vgpr0 def $vgpr0_vgpr1 killed $exec
+; GFX90A-TGSPLIT-NEXT: v_mov_b32_e32 v1, v2
+; GFX90A-TGSPLIT-NEXT: global_load_dword v0, v[0:1], off glc
+; GFX90A-TGSPLIT-NEXT: s_waitcnt vmcnt(0)
+; GFX90A-TGSPLIT-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX12-WGP-LABEL: agent_acquire_load_av_none:
+; GFX12-WGP: ; %bb.0: ; %entry
+; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
+; GFX12-WGP-NEXT: s_wait_expcnt 0x0
+; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
+; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
+; GFX12-WGP-NEXT: s_wait_kmcnt 0x0
+; GFX12-WGP-NEXT: v_mov_b32_e32 v2, v1
+; GFX12-WGP-NEXT: ; kill: def $vgpr0 killed $vgpr0 def $vgpr0_vgpr1 killed $exec
+; GFX12-WGP-NEXT: v_mov_b32_e32 v1, v2
+; GFX12-WGP-NEXT: global_load_b32 v0, v[0:1], off scope:SCOPE_DEV
+; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
+; GFX12-WGP-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX12-CU-LABEL: agent_acquire_load_av_none:
+; GFX12-CU: ; %bb.0: ; %entry
+; GFX12-CU-NEXT: s_wait_loadcnt_dscnt 0x0
+; GFX12-CU-NEXT: s_wait_expcnt 0x0
+; GFX12-CU-NEXT: s_wait_samplecnt 0x0
+; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
+; GFX12-CU-NEXT: s_wait_kmcnt 0x0
+; GFX12-CU-NEXT: v_mov_b32_e32 v2, v1
+; GFX12-CU-NEXT: ; kill: def $vgpr0 killed $vgpr0 def $vgpr0_vgpr1 killed $exec
+; GFX12-CU-NEXT: v_mov_b32_e32 v1, v2
+; GFX12-CU-NEXT: global_load_b32 v0, v[0:1], off scope:SCOPE_DEV
+; GFX12-CU-NEXT: s_wait_loadcnt 0x0
+; GFX12-CU-NEXT: s_setpc_b64 s[30:31]
+entry:
+ %val = load atomic i32, ptr addrspace(1) %ptr syncscope("agent") acquire, align 4, !mmra !0
+ ret i32 %val
+}
+
+define i32 @system_acquire_load_av_none(ptr addrspace(1) %ptr) {
+; GFX90A-LABEL: system_acquire_load_av_none:
+; GFX90A: ; %bb.0: ; %entry
+; GFX90A-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX90A-NEXT: v_mov_b32_e32 v2, v1
+; GFX90A-NEXT: ; ...
[truncated]
|
| auto MMRA = MMRAMetadata(MI.getMMRAMetadata()); | ||
| if (!MMRA) | ||
| return false; | ||
| return MMRA.hasTag("amdgcn-av", "none"); |
There was a problem hiding this comment.
Should the tag name be amdgpu-av, to be more consistent with the existing amdgpu-synchronize-as?
There was a problem hiding this comment.
Also: should we diagnose values other than "none"? There is no case where we want to make use of the happens-before-breaking semantics of incompatible MMRAs, right?
There was a problem hiding this comment.
Should the tag name be
amdgpu-av, to be more consistent with the existingamdgpu-synchronize-as?
It's something I had explored. amdgpu-synchronize-as is one of the rare places where "amdgpu" is used, while in the case of almost all builtins, intrinsics and metadata, "amdgcn" is the convention. @Pierre-vh were you trying to start a newer convention with the amdgpu-synchronize-as?
Also: should we diagnose values other than
"none"? There is no case where we want to make use of the happens-before-breaking semantics of incompatible MMRAs, right?
Yes, I will add the check.
There was a problem hiding this comment.
Also: should we diagnose values other than
"none"? There is no case where we want to make use of the happens-before-breaking semantics of incompatible MMRAs, right?
I remember I asked similar question somewhere else but can't find it. What should be the correct way of handling metadata verification? In the IR verifier or where they are being used?
There was a problem hiding this comment.
Yeah, I was struggling with that when validating "!amdgcn-av !none". For now I just copied what is done for "!amdgpu-synchronize-as", which is to validate it at the time of consumption. It would be nice to separately work on an AMDGPU metadata verifier plugged into the IR verifier.
Co-authored-by: Pierre van Houtryve <pierre.vanhoutryve@amd.com>
…l' into users/ssahasra/av-metadata
🐧 Linux x64 Test Results
✅ The build succeeded and all tests passed. |
🪟 Windows x64 Test Results
✅ The build succeeded and all tests passed. |
…C) (#199486) A release consists of two actions: write-back the current cache, and wait for "relevant" outstanding operations to complete. With the new memory model, it is possible to disable the cache write-back using "non-av". This patch cleanly separates the existing implementation so that the write-backs can be selectively applied after checking for non-av semantics. Part of a stack: - #199486 - #199621 - #199489 - #199622 Assisted-By: Claude Opus 4.6 --------- Co-authored-by: Pierre van Houtryve <pierre.vanhoutryve@amd.com>
RyanRio
left a comment
There was a problem hiding this comment.
Don't see any issues with this. You could also add a test that combines the synchronize-as metadata with the av metadata, for applicable areas.
…s/ssahasra/av-metadata
When the MMRA tag !{!"amdgcn-av", !"none"} is present on a synchronization operation (fence, atomic load/store/rmw/cmpxchg), suppress cache writeback (MakeAvailable) and cache invalidation (MakeVisible) while preserving memory ordering (waits). This implements the metadata proposed in #191246.
Part of a stack:
amdgcn_av("none")attribute for atomic expressions #199622Fixes: LCOMPILER-2214
Assisted-By: Claude Opus 4.6