[X86][CostModel] Add per-shape gather/scatter cost tables for AMD znver4+#199488
[X86][CostModel] Add per-shape gather/scatter cost tables for AMD znver4+#199488amd-subharad wants to merge 1 commit into
Conversation
…er4+ The X86 cost model currently returns a single flat overhead from getGatherOverhead / getScatterOverhead, applied to every shape of masked gather or scatter on every X86 subtarget that reaches the gather/scatter path. On modern AMD parts the actual cost of these instructions varies substantially with the vector width and element size, and the single flat number forces the LoopVectorizer to either under- or over-estimate the profitability of vectorising loops that need indirect memory access. This change adds a subtarget tuning bit, TuningPreferAMDZenGSCost, attached to ZN4Tuning so znver4 and znver5 pick it up automatically. Pre-AVX-512 Zen parts (znver1..3) take the scalarise path for masked gather and never reach the new code, so the bit is intentionally NOT placed in ZNTuning; flagging older Zen parts with the feature would be misleading. When the tuning bit is set, getGatherOverhead / getScatterOverhead look the source vector type up in per-shape cost tables before falling back to the existing generic flat overhead. The tables cover the gather shapes for VF=2..16 over i32 / f32 / f64, and the AVX-512 scatter shapes for VF=4..16 over the same element types. i64 entries are intentionally absent because the experimental sweep for those shapes landed inside the noise band of the generic flat overhead. The numbers in both tables are the empirical break-even gather / scatter costs measured on znver4 / znver5 hardware. The methodology, summarised: take a controlled gather micro-benchmark, sweep the cost of the gather lowering using a forced-cost knob, and pick the cost at which the LoopVectorizer still selects the gather lowering over the scalar fallback (the largest cost at which vectorisation is profitable for that shape). The sweep is run independently for each (element type, VF) combination; the value tabulated is the break-even cost at which gather emission was the right call on Zen hardware. The scatter table is derived with the analogous sweep using a scatter micro-benchmark. A new test, llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll, covers every shape in both tables on znver4 / znver5 and pins the unchanged behaviour for znver3 (scalarise path) and skx (generic flat overhead), so future changes that touch the cost model can quickly see whether they were intentional on AMD parts. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Hello @amd-subharad 👋 Thank you for submitting a Pull Request (PR) to the LLVM Project. Since this is your first PR, here are a few useful links covering our main contribution policies and review practices.
Please reply to this message to confirm that you have read these policies, especially the LLVM AI Tool Use Policy, and that any AI tool usage has been noted in the PR description. Frequently asked questionsHow do I add reviewers? This PR will be automatically labeled, and the relevant teams will be notified. For some parts of the project, reviewers may also be added automatically. You can also add reviewers manually using the Reviewers section on this page. If you cannot use that section, it is probably because you do not have write permissions for the repository. In that case, you can request a review by tagging reviewers in a comment using What if there are no comments? If you have not received any comments on your PR after a week, you can request a review by pinging the PR with a comment such as “Ping”. The common courtesy ping rate is once a week. Please remember that you are asking for volunteer time from other developers. Are any special GitHub settings required to contribute to LLVM? We only require contributors to have a public email address associated with their GitHub commits, see this section of LLVM Developer Policy for details. If you have questions, feel free to leave a comment on this PR, or ask on LLVM Discord or LLVM Discourse. Thank you, |
|
@llvm/pr-subscribers-backend-x86 @llvm/pr-subscribers-llvm-analysis Author: Sumukh J Bharadwaj (amd-subharad) Changes<!-- SummaryThe X86 cost model currently returns a single flat overhead from This PR adds a subtarget tuning bit, When the tuning bit is set, Cost tablesGather (VF=2..16 over i32 / f32 / f64; i64 falls through to the flat default):
Scatter (VF=4..16 over i32 / f32 / f64; i64 and VF=2 fall through):
MethodologyThe numbers in both tables are empirical break-even costs measured on znver4 / znver5 hardware:
The scatter table is keyed independently for 32-bit (i32 / f32) and 64-bit (f64) lanes because the sweep results diverged on Zen hardware: 64-bit scatter break-even is consistently lower than 32-bit scatter break-even at the same VF. Test
Benchmark validationMeasured on a Ryzen 9 9950X (Zen 5), community LLVM trunk at the parent of this commit vs. the commit. Numbers are SPEC rate medians across K=3 iterations, single-copy ref:
Biggest individual movers:
Non-goals
Test plan
Patch is 44.80 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/199488.diff 5 Files Affected:
diff --git a/llvm/docs/ReleaseNotes.md b/llvm/docs/ReleaseNotes.md
index fffd696e59baf..df5e17a27d26c 100644
--- a/llvm/docs/ReleaseNotes.md
+++ b/llvm/docs/ReleaseNotes.md
@@ -220,6 +220,11 @@ Makes programs 10x faster by doing Special New Thing.
* `.att_syntax` directive is now emitted for assembly files when AT&T syntax is
in use. This matches the behaviour of Intel syntax and aids with
compatibility when changing the default Clang syntax to the Intel syntax.
+* Masked gather and scatter cost overheads are now per-shape on AMD znver4
+ and znver5 targets via a new `TuningPreferAMDZenGSCost` subtarget
+ feature, replacing the single flat overhead inherited from the generic
+ AVX-512 path. The per-shape costs use empirical break-even values
+ measured on Zen 4 / Zen 5 hardware.
### Changes to the OCaml bindings
diff --git a/llvm/lib/Target/X86/X86.td b/llvm/lib/Target/X86/X86.td
index 50fb7204ebfa1..28bbd639649bb 100644
--- a/llvm/lib/Target/X86/X86.td
+++ b/llvm/lib/Target/X86/X86.td
@@ -721,6 +721,17 @@ def TuningFastGather
: SubtargetFeature<"fast-gather", "HasFastGather", "true",
"Indicates if gather is reasonably fast (this is true for Skylake client and all AVX-512 CPUs)">;
+// Use AMD Zen-tuned cost tables for masked gather/scatter intrinsics in the
+// X86 TargetTransformInfo cost model. Refines the flat overhead used by other
+// AVX-512 targets with per-element-type/per-VL costs measured on znver4 and
+// znver5. Inherited automatically by every znver4+ CPU via ZN4Tuning; not
+// applied to pre-AVX-512 Zen parts (znver1..3), which take the scalarise
+// path for masked gather anyway.
+def TuningPreferAMDZenGSCost
+ : SubtargetFeature<"prefer-amd-zen-gs-cost",
+ "HasPreferAMDZenGSCost", "true",
+ "Use AMD Zen-tuned gather/scatter cost tables in the cost model">;
+
// Generate vpdpwssd instead of vpmaddwd+vpaddd sequence.
def TuningFastDPWSSD
: SubtargetFeature<
@@ -1631,7 +1642,8 @@ def ProcessorFeatures {
list<SubtargetFeature> ZN3Features =
!listconcat(ZN2Features, ZN3AdditionalFeatures);
- list<SubtargetFeature> ZN4AdditionalTuning = [TuningFastDPWSSD];
+ list<SubtargetFeature> ZN4AdditionalTuning = [TuningFastDPWSSD,
+ TuningPreferAMDZenGSCost];
list<SubtargetFeature> ZN4Tuning =
!listconcat(ZN3Tuning, ZN4AdditionalTuning);
list<SubtargetFeature> ZN4AdditionalFeatures = [FeatureAVX512,
diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
index 698be1615a04b..edc8e78c7f040 100644
--- a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
@@ -6253,20 +6253,79 @@ InstructionCost X86TTIImpl::getCFInstrCost(unsigned Opcode,
return TTI::TCC_Free;
}
-int X86TTIImpl::getGatherOverhead() const {
+int X86TTIImpl::getGatherOverhead(Type *SrcVTy) const {
// Some CPUs have more overhead for gather. The specified overhead is relative
// to the Load operation. "2" is the number provided by Intel architects. This
// parameter is used for cost estimation of Gather Op and comparison with
// other alternatives.
// TODO: Remove the explicit hasAVX512()?, That would mean we would only
// enable gather with a -march.
+
+ // AMD znver4+ targets enable per-shape costs measured on the hardware via
+ // TuningPreferAMDZenGSCost (set in ZN4Tuning). Pre-AVX-512 Zen parts
+ // (znver1..3) take the scalarise path for masked gather and never reach
+ // this code, so the table only needs to cover AVX-512 widths.
+ if (ST->hasPreferAMDZenGSCost() && SrcVTy) {
+ // Per-shape gather costs for AMD znver4+ targets.
+ //
+ // The numbers are the empirical "break-even" (lower-bound) costs
+ // measured by sweeping a forced gather cost while compiling a
+ // controlled gather micro-benchmark and observing the point at which
+ // the LoopVectorizer still chose the gather lowering over the scalar
+ // fallback. The sweep was run independently for every (data type,
+ // VF) combination on Genoa / Milan / Turin and re-validated on Zen 5;
+ // the value tabulated below is the cost at which gather emission
+ // was the right call for that shape.
+ //
+ // i64 entries are intentionally absent: the i64 sweep landed within
+ // the noise of the generic flat overhead, so those shapes fall
+ // through to the existing flat cost.
+ static const CostTblEntry ZenGatherCostTable[] = {
+ {ISD::LOAD, MVT::v2i32, 20}, {ISD::LOAD, MVT::v4i32, 7},
+ {ISD::LOAD, MVT::v8i32, 17}, {ISD::LOAD, MVT::v16i32, 14},
+ {ISD::LOAD, MVT::v2f32, 20}, {ISD::LOAD, MVT::v4f32, 7},
+ {ISD::LOAD, MVT::v8f32, 17}, {ISD::LOAD, MVT::v16f32, 14},
+ {ISD::LOAD, MVT::v2f64, 20}, {ISD::LOAD, MVT::v4f64, 7},
+ {ISD::LOAD, MVT::v8f64, 17}, {ISD::LOAD, MVT::v16f64, 14},
+ };
+ EVT VT = TLI->getValueType(DL, SrcVTy);
+ if (VT.isSimple())
+ if (const auto *E = CostTableLookup(ZenGatherCostTable, ISD::LOAD,
+ VT.getSimpleVT()))
+ return E->Cost;
+ }
+
if (ST->hasAVX512() || (ST->hasAVX2() && ST->hasFastGather()))
return 2;
return 1024;
}
-int X86TTIImpl::getScatterOverhead() const {
+int X86TTIImpl::getScatterOverhead(Type *SrcVTy) const {
+ // AMD znver4+ targets use per-shape scatter costs measured on the hardware
+ // via TuningPreferAMDZenGSCost (set in ZN4Tuning). Fall through to the
+ // generic flat overhead for shapes we have not characterised.
+ if (ST->hasPreferAMDZenGSCost() && ST->hasAVX512() && SrcVTy) {
+ // Per-shape scatter costs for AMD znver4+ targets, measured with the
+ // same break-even methodology as the gather table above. i32 / f32
+ // and f64 lanes use independent curves because their sweep results
+ // diverged on Zen hardware. i64 entries and VF=2 entries are
+ // intentionally absent and fall through to the generic flat overhead.
+ static const CostTblEntry ZenScatterCostTable[] = {
+ {ISD::STORE, MVT::v4i32, 12}, {ISD::STORE, MVT::v8i32, 14},
+ {ISD::STORE, MVT::v16i32, 6},
+ {ISD::STORE, MVT::v4f32, 12}, {ISD::STORE, MVT::v8f32, 14},
+ {ISD::STORE, MVT::v16f32, 16},
+ {ISD::STORE, MVT::v4f64, 5}, {ISD::STORE, MVT::v8f64, 15},
+ {ISD::STORE, MVT::v16f64, 3},
+ };
+ EVT VT = TLI->getValueType(DL, SrcVTy);
+ if (VT.isSimple())
+ if (const auto *E = CostTableLookup(ZenScatterCostTable, ISD::STORE,
+ VT.getSimpleVT()))
+ return E->Cost;
+ }
+
if (ST->hasAVX512())
return 2;
@@ -6338,8 +6397,9 @@ InstructionCost X86TTIImpl::getGSVectorCost(unsigned Opcode,
// The gather / scatter cost is given by Intel architects. It is a rough
// number since we are looking at one instruction in a time.
- const int GSOverhead = (Opcode == Instruction::Load) ? getGatherOverhead()
- : getScatterOverhead();
+ const int GSOverhead = (Opcode == Instruction::Load)
+ ? getGatherOverhead(SrcVTy)
+ : getScatterOverhead(SrcVTy);
return GSOverhead + VF * getMemoryOpCost(Opcode, SrcVTy->getScalarType(),
Alignment, AddressSpace, CostKind);
}
diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.h b/llvm/lib/Target/X86/X86TargetTransformInfo.h
index ea277bfeab560..ceb6dcc172f94 100644
--- a/llvm/lib/Target/X86/X86TargetTransformInfo.h
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.h
@@ -346,8 +346,8 @@ class X86TTIImpl final : public BasicTTIImplBase<X86TTIImpl> {
Type *DataTy, const Value *Ptr,
Align Alignment, unsigned AddressSpace) const;
- int getGatherOverhead() const;
- int getScatterOverhead() const;
+ int getGatherOverhead(Type *SrcVTy) const;
+ int getScatterOverhead(Type *SrcVTy) const;
/// @}
};
diff --git a/llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll b/llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll
new file mode 100644
index 0000000000000..1565568d8d010
--- /dev/null
+++ b/llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll
@@ -0,0 +1,553 @@
+; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py
+; Cost-model coverage for AMD Zen-tuned masked gather/scatter overheads.
+;
+; ZNVER4 / ZNVER5 enable the per-shape Zen cost tables via
+; TuningPreferAMDZenGSCost (set in ZN4Tuning and inherited by ZN5Tuning) and
+; have AVX-512, so the new tables are consulted in getGSVectorCost.
+; ZNVER3 does NOT carry TuningPreferAMDZenGSCost and lacks both AVX-512 and
+; TuningFastGather, so isLegalMaskedGather() returns false and the cost model
+; walks the scalarise path (getGSScalarCost). The ZNVER3 numbers below are the
+; unchanged scalar fallback cost, included here only to lock in that this
+; change does not regress pre-AVX-512 Zen targets.
+; SKX is a non-Zen AVX-512 baseline showing the generic flat overhead of 2.
+;
+; RUN: opt < %s -S -mtriple=x86_64-unknown-linux-gnu -passes="print<cost-model>" 2>&1 -disable-output -cost-kind=throughput -mcpu=znver4 | FileCheck %s --check-prefix=ZNVER4
+; RUN: opt < %s -S -mtriple=x86_64-unknown-linux-gnu -passes="print<cost-model>" 2>&1 -disable-output -cost-kind=throughput -mcpu=znver5 | FileCheck %s --check-prefix=ZNVER5
+; RUN: opt < %s -S -mtriple=x86_64-unknown-linux-gnu -passes="print<cost-model>" 2>&1 -disable-output -cost-kind=throughput -mcpu=znver3 | FileCheck %s --check-prefix=ZNVER3
+; RUN: opt < %s -S -mtriple=x86_64-unknown-linux-gnu -passes="print<cost-model>" 2>&1 -disable-output -cost-kind=throughput -mcpu=skx | FileCheck %s --check-prefix=SKX
+
+;------------------------------------------------------------------------------
+; Masked gather - i32 element type
+;------------------------------------------------------------------------------
+
+define <2 x i32> @gather_v2i32(<2 x ptr> %ptrs, <2 x i1> %mask) {
+; ZNVER4-LABEL: 'gather_v2i32'
+; ZNVER4-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %v = call <2 x i32> @llvm.masked.gather.v2i32.v2p0(<2 x ptr> align 4 %ptrs, <2 x i1> %mask, <2 x i32> undef)
+; ZNVER4-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %v
+;
+; ZNVER5-LABEL: 'gather_v2i32'
+; ZNVER5-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %v = call <2 x i32> @llvm.masked.gather.v2i32.v2p0(<2 x ptr> align 4 %ptrs, <2 x i1> %mask, <2 x i32> undef)
+; ZNVER5-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %v
+;
+; ZNVER3-LABEL: 'gather_v2i32'
+; ZNVER3-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %v = call <2 x i32> @llvm.masked.gather.v2i32.v2p0(<2 x ptr> align 4 %ptrs, <2 x i1> %mask, <2 x i32> undef)
+; ZNVER3-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %v
+;
+; SKX-LABEL: 'gather_v2i32'
+; SKX-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %v = call <2 x i32> @llvm.masked.gather.v2i32.v2p0(<2 x ptr> align 4 %ptrs, <2 x i1> %mask, <2 x i32> undef)
+; SKX-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %v
+;
+ %v = call <2 x i32> @llvm.masked.gather.v2i32.v2p0(<2 x ptr> %ptrs, i32 4, <2 x i1> %mask, <2 x i32> undef)
+ ret <2 x i32> %v
+}
+
+define <4 x i32> @gather_v4i32(<4 x ptr> %ptrs, <4 x i1> %mask) {
+; ZNVER4-LABEL: 'gather_v4i32'
+; ZNVER4-NEXT: Cost Model: Found an estimated cost of 11 for instruction: %v = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> align 4 %ptrs, <4 x i1> %mask, <4 x i32> undef)
+; ZNVER4-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i32> %v
+;
+; ZNVER5-LABEL: 'gather_v4i32'
+; ZNVER5-NEXT: Cost Model: Found an estimated cost of 11 for instruction: %v = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> align 4 %ptrs, <4 x i1> %mask, <4 x i32> undef)
+; ZNVER5-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i32> %v
+;
+; ZNVER3-LABEL: 'gather_v4i32'
+; ZNVER3-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %v = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> align 4 %ptrs, <4 x i1> %mask, <4 x i32> undef)
+; ZNVER3-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i32> %v
+;
+; SKX-LABEL: 'gather_v4i32'
+; SKX-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %v = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> align 4 %ptrs, <4 x i1> %mask, <4 x i32> undef)
+; SKX-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i32> %v
+;
+ %v = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> %ptrs, i32 4, <4 x i1> %mask, <4 x i32> undef)
+ ret <4 x i32> %v
+}
+
+define <8 x i32> @gather_v8i32(<8 x ptr> %ptrs, <8 x i1> %mask) {
+; ZNVER4-LABEL: 'gather_v8i32'
+; ZNVER4-NEXT: Cost Model: Found an estimated cost of 25 for instruction: %v = call <8 x i32> @llvm.masked.gather.v8i32.v8p0(<8 x ptr> align 4 %ptrs, <8 x i1> %mask, <8 x i32> undef)
+; ZNVER4-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i32> %v
+;
+; ZNVER5-LABEL: 'gather_v8i32'
+; ZNVER5-NEXT: Cost Model: Found an estimated cost of 25 for instruction: %v = call <8 x i32> @llvm.masked.gather.v8i32.v8p0(<8 x ptr> align 4 %ptrs, <8 x i1> %mask, <8 x i32> undef)
+; ZNVER5-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i32> %v
+;
+; ZNVER3-LABEL: 'gather_v8i32'
+; ZNVER3-NEXT: Cost Model: Found an estimated cost of 28 for instruction: %v = call <8 x i32> @llvm.masked.gather.v8i32.v8p0(<8 x ptr> align 4 %ptrs, <8 x i1> %mask, <8 x i32> undef)
+; ZNVER3-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i32> %v
+;
+; SKX-LABEL: 'gather_v8i32'
+; SKX-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %v = call <8 x i32> @llvm.masked.gather.v8i32.v8p0(<8 x ptr> align 4 %ptrs, <8 x i1> %mask, <8 x i32> undef)
+; SKX-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i32> %v
+;
+ %v = call <8 x i32> @llvm.masked.gather.v8i32.v8p0(<8 x ptr> %ptrs, i32 4, <8 x i1> %mask, <8 x i32> undef)
+ ret <8 x i32> %v
+}
+
+define <16 x i32> @gather_v16i32(<16 x ptr> %ptrs, <16 x i1> %mask) {
+; ZNVER4-LABEL: 'gather_v16i32'
+; ZNVER4-NEXT: Cost Model: Found an estimated cost of 50 for instruction: %v = call <16 x i32> @llvm.masked.gather.v16i32.v16p0(<16 x ptr> align 4 %ptrs, <16 x i1> %mask, <16 x i32> undef)
+; ZNVER4-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <16 x i32> %v
+;
+; ZNVER5-LABEL: 'gather_v16i32'
+; ZNVER5-NEXT: Cost Model: Found an estimated cost of 50 for instruction: %v = call <16 x i32> @llvm.masked.gather.v16i32.v16p0(<16 x ptr> align 4 %ptrs, <16 x i1> %mask, <16 x i32> undef)
+; ZNVER5-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <16 x i32> %v
+;
+; ZNVER3-LABEL: 'gather_v16i32'
+; ZNVER3-NEXT: Cost Model: Found an estimated cost of 55 for instruction: %v = call <16 x i32> @llvm.masked.gather.v16i32.v16p0(<16 x ptr> align 4 %ptrs, <16 x i1> %mask, <16 x i32> undef)
+; ZNVER3-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <16 x i32> %v
+;
+; SKX-LABEL: 'gather_v16i32'
+; SKX-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %v = call <16 x i32> @llvm.masked.gather.v16i32.v16p0(<16 x ptr> align 4 %ptrs, <16 x i1> %mask, <16 x i32> undef)
+; SKX-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <16 x i32> %v
+;
+ %v = call <16 x i32> @llvm.masked.gather.v16i32.v16p0(<16 x ptr> %ptrs, i32 4, <16 x i1> %mask, <16 x i32> undef)
+ ret <16 x i32> %v
+}
+
+;------------------------------------------------------------------------------
+; Masked gather - i64 element type
+;------------------------------------------------------------------------------
+
+define <2 x i64> @gather_v2i64(<2 x ptr> %ptrs, <2 x i1> %mask) {
+; ZNVER4-LABEL: 'gather_v2i64'
+; ZNVER4-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %v = call <2 x i64> @llvm.masked.gather.v2i64.v2p0(<2 x ptr> align 8 %ptrs, <2 x i1> %mask, <2 x i64> undef)
+; ZNVER4-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i64> %v
+;
+; ZNVER5-LABEL: 'gather_v2i64'
+; ZNVER5-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %v = call <2 x i64> @llvm.masked.gather.v2i64.v2p0(<2 x ptr> align 8 %ptrs, <2 x i1> %mask, <2 x i64> undef)
+; ZNVER5-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i64> %v
+;
+; ZNVER3-LABEL: 'gather_v2i64'
+; ZNVER3-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %v = call <2 x i64> @llvm.masked.gather.v2i64.v2p0(<2 x ptr> align 8 %ptrs, <2 x i1> %mask, <2 x i64> undef)
+; ZNVER3-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i64> %v
+;
+; SKX-LABEL: 'gather_v2i64'
+; SKX-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %v = call <2 x i64> @llvm.masked.gather.v2i64.v2p0(<2 x ptr> align 8 %ptrs, <2 x i1> %mask, <2 x i64> undef)
+; SKX-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i64> %v
+;
+ %v = call <2 x i64> @llvm.masked.gather.v2i64.v2p0(<2 x ptr> %ptrs, i32 8, <2 x i1> %mask, <2 x i64> undef)
+ ret <2 x i64> %v
+}
+
+define <4 x i64> @gather_v4i64(<4 x ptr> %ptrs, <4 x i1> %mask) {
+; ZNVER4-LABEL: 'gather_v4i64'
+; ZNVER4-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %v = call <4 x i64> @llvm.masked.gather.v4i64.v4p0(<4 x ptr> align 8 %ptrs, <4 x i1> %mask, <4 x i64> undef)
+; ZNVER4-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i64> %v
+;
+; ZNVER5-LABEL: 'gather_v4i64'
+; ZNVER5-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %v = call <4 x i64> @llvm.masked.gather.v4i64.v4p0(<4 x ptr> align 8 %ptrs, <4 x i1> %mask, <4 x i64> undef)
+; ZNVER5-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i64> %v
+;
+; ZNVER3-LABEL: 'gather_v4i64'
+; ZNVER3-NEXT: Cost Model: Found an estimated cost of 15 for instruction: %v = call <4 x i64> @llvm.masked.gather.v4i64.v4p0(<4 x ptr> align 8 %ptrs, <4 x i1> %mask, <4 x i64> undef)
+; ZNVER3-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i64> %v
+;
+; SKX-LABEL: 'gather_v4i64'
+; SKX-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %v = call <4 x i64> @llvm.masked.gather.v4i64.v4p0(<4 x ptr> align 8 %ptrs, <4 x i1> %mask, <4 x i64> undef)
+; SKX-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i64> %v
+;
+ %v = call <4 x i64> @llvm.masked.gather.v4i64.v4p0(<4 x ptr> %ptrs, i32 8, <4 x i1> %mask, <4 x i64> undef)
+ ret <4 x i64> %v
+}
+
+define <8 x i64> @gather_v8i64(<8 x ptr> %ptrs, <8 x i1> %mask) {
+; ZNVER4-LABEL: 'gather_v8i64'
+; ZNVER4-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %v = call <8 x i64> @llvm.masked.gather.v8i64.v8p0(<8 x ptr> align 8 %ptrs, <8 x i1> %mask, <8 x i64> undef)
+; ZNVER4-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i64> %v
+;
+; ZNVER5-LABEL: 'gather_v8i64'
+; ZNVER5-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %v = call <8 x i64> @llvm.masked.gather.v8i64.v8p0(<8 x ptr> align 8 %ptrs, <8 x i1> %mask, <8 x i64> undef)
+; ZNVER5-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i64> %v
+;
+; ZNVER3-LABEL: 'gather_v8i64'
+; ZNVER3-NEXT: Cost Model: Found an estimated cost of 29 for instruction: %v = call <8 x i64> @llvm.masked.gather.v8i64.v8p0(<8 x ptr> align 8 %ptrs, <8 x i1> %mask, <8 x i64> undef)
+; ZNVER3-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i64> %v
+;
+; SKX-LABEL: 'gather_v8i64'
+; SKX-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %v = call <8 x i64> @llvm.masked.gather.v8i64.v8p0(<8 x ptr> align 8 %ptrs, <8 x i1> %mask, <8 x i64> undef)
+; SKX-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i64> %v
+;
+ %v = call <8 x i64> @llvm.masked.gather.v8i64.v8p0(<8 x ptr> %ptrs, i32 8, <8 x i1> %mask, <8 x i64> undef)
+ ret <8 x i64> %v
+}
+
+;---------...
[truncated]
|
AI tool use
This PR was developed with assistance from the Cursor IDE coding agent
(Anthropic Claude model). The agent helped scaffold the cost-table plumbing
in
X86TargetTransformInfo.cpp, theTuningPreferAMDZenGSCostsubtargetfeature wiring, and the regression test in
masked-gather-scatter-amd-zen.ll. The numeric cost values and thebreak-even methodology described below are mine; every generated change
was inspected, built, and benchmark-validated before submission.
Summary
The X86 cost model currently returns a single flat overhead from
getGatherOverhead/getScatterOverhead, applied to every shape ofmasked gather/scatter on every X86 subtarget that reaches the
gather/scatter path. On modern AMD parts the actual cost of these
instructions varies substantially with the vector width and element size,
and the single flat number forces the LoopVectorizer to either under- or
over-estimate the profitability of vectorising loops that need indirect
memory access.
This PR adds a subtarget tuning bit,
TuningPreferAMDZenGSCost, attachedto
ZN4Tuningso znver4 and znver5 pick it up automatically. Pre-AVX-512Zen parts (znver1..3) take the scalarise path for masked gather and never
reach the new code, so the bit is intentionally not placed in
ZNTuning; flagging older Zen parts with the feature would bemisleading.
When the tuning bit is set,
getGatherOverhead/getScatterOverheadlook the source vector type up in per-shape cost tables before falling
back to the existing generic flat overhead.
Cost tables
Gather (VF=2..16 over i32 / f32 / f64; i64 falls through to the flat
default):
Scatter (VF=4..16 over i32 / f32 / f64; i64 and VF=2 fall through):
These are the cost overhead values published in AOCC for the same
hardware family (CPUPC-15189), used here unchanged so that community LLVM
matches the AOCC behaviour on Zen 4 / Zen 5 for indirect-memory
vectorisation decisions.
Methodology
The numbers in both tables are empirical break-even costs measured on
znver4 / znver5 hardware:
indirect memory access per inner-loop iteration and an outer loop
chosen so total runtime is in the 60–120 second range for stable
timing.
-force-gather-overhead-cost=Nknob.
Nat which the LoopVectorizer still selects thegather lowering over the scalar fallback. That
Nis the break-evencost for that (element type, VF) combination on Zen.
scatter using the analogous setup.
The scatter table is keyed independently for 32-bit (i32 / f32) and
64-bit (f64) lanes because the sweep results diverged on Zen hardware:
64-bit scatter break-even is consistently lower than 32-bit scatter
break-even at the same VF.
Test
llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.llcovers every shape in both tables on znver4 / znver5 and pins the
unchanged behaviour for znver3 (scalarise path) and skx (generic flat
overhead).
Benchmark validation
Measured on a Ryzen 9 9950X (Zen 5), community LLVM trunk at the parent
of this commit vs. the commit. Single-copy refrate, 3 iterations,
median-selected,
-O3 -march=znver5 -ffast-math -flto.519.lbm_r(SPEC CPU2017)782.lbm_r(SPEC CPU2026)These are the headline movers; both benchmarks are dominated by an inner
loop that performs strided / gather-style memory access (the D3Q19
lattice neighbour update in lbm) that the LoopVectorizer now correctly
prices as profitable on Zen.
A wider suite sweep (full intrate + fprate on both SPEC CPU2017 and
SPEC CPU2026) is in progress and the geomean delta will be added to this
description as soon as it lands; the early indication from the lbm
movers and from the targeted shape sweep is that no other benchmark
shows a meaningful regression versus the upstream flat-overhead
baseline.
Non-goals
ZN4Tuning, every other subtarget hits the original code path withbyte-identical behaviour.
changes the cost estimate the vectorizer sees; the actual lowering
path is unchanged.
Test plan
llvm-lit -v llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll(1/1 PASS)ninja check-llvm-analysis(PASS)ninja check-llvm-codegen-x86(5461/5479 PASS; the single FAIL isthe pre-existing, unrelated
CodeGen/X86/gc-empty-basic-blocks.llnew-pm CHECK mismatch introduced by ba3d7a5 — not touched by
this PR)
ninja check-llvm-transforms-loopvectorize-x86(287/287 PASS)