Remove unsafe code from System.Linq.MaxMin#127845
Remove unsafe code from System.Linq.MaxMin#127845EgorBo wants to merge 2 commits intodotnet:mainfrom
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
15bf50d to
8837a91
Compare
This comment was marked as resolved.
This comment was marked as resolved.
4453202 to
9a4b7b1
Compare
647f299 to
4eeec93
Compare
This comment was marked as resolved.
This comment was marked as resolved.
|
@EgorBot -arm -amd using System;
using System.Linq;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);
public class Bench
{
private int[] _ints = default!;
private long[] _longs = default!;
private byte[] _bytes = default!;
[Params(2, 4, 16, 32, 200, 1000)]
public int Length { get; set; }
[GlobalSetup]
public void Setup()
{
var rng = new Random(42);
_ints = new int[Length];
_longs = new long[Length];
_bytes = new byte[Length];
for (int i = 0; i < Length; i++)
{
_ints[i] = rng.Next();
_longs[i] = ((long)rng.Next() << 32) | (uint)rng.Next();
_bytes[i] = (byte)rng.Next(256);
}
}
[Benchmark] public int IntMax() => _ints.Max();
[Benchmark] public int IntMin() => _ints.Min();
[Benchmark] public long LongMax() => _longs.Max();
[Benchmark] public byte ByteMax() => _bytes.Max();
} |
|
PTAL @MihaZupan @tannergooding Had to mark the method NoInlining due to inliner issues I want to address separately. Currently it adds 0.5ns call overhead for small inputs (but only for trivial benchmark wrappers where the entire thing used to inline, in the real world it's very unlikely to happen) Otherwise, no extra bound checks in this impl in codegen. benchmark: #127845 (comment) and results: EgorBot/Benchmarks#186 Follow ups:
|
MihaZupan
left a comment
There was a problem hiding this comment.
GitHub reviews are broken atm, so here's the comment I wanted to post:
Does this do anything since it's on the interface?
In this method we currently still run out of time budget, [AI] gave inliner a hint that we should inline this 14b method even we when already ran out of time budget. Otherwise I see non-inline calls in the codegen. Also something I want to clean up in a separate inliner change. Although, in this case we just need to increase the budget based on some heuristics like Vector.IsSupported branches. |
|
I meant specifically about adding the attribute on the interface's abstract definition itself. |
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| public static bool Compare(T left, T right) => left > right; |
There was a problem hiding this comment.
Are these not within the "always inline" threshold already?
I would've though this is like 5-6 bytes of IL and so something that we should trivially believe is in budget, particularly for primitive T
Maybe it being an op_GreaterThan constrained call is what's throwing things off?
There was a problem hiding this comment.
yeah this one was 14 bytes of IL, and we reduce "always inline" to 12 bytes once we ran out of time budget.
I removed this AggressiveInlining as it's no longer relevant with #127845 (comment)
There was a problem hiding this comment.
Interesting, I wonder if we should special case some of the IL sizes here like this one where it's namely big because of the constrained. <token> prefix.
We could of course special case all the generic math calls for the primitives as well, but it'd be nice if we didn't need to do that and have the inliner "do the right thing".
There was a problem hiding this comment.
Maybe in T1 we could do something like "inline everything <= 12 bytes" and then look at IL to make a heuristic for everything <= otherBound", such as if it is a single call or other special cases that are "likely profitable" but slightly larger.
4d1b2d4 to
a8ebe80
Compare
|
@MihaZupan @tannergooding so I decided to take another approach: if we already happen to inline a method with |
|
@EgorBot -arm -amd using System;
using System.Linq;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);
public class Bench
{
private int[] _ints = default!;
private long[] _longs = default!;
private byte[] _bytes = default!;
[Params(2, 4, 16, 32, 200, 1000)]
public int Length { get; set; }
[GlobalSetup]
public void Setup()
{
var rng = new Random(42);
_ints = new int[Length];
_longs = new long[Length];
_bytes = new byte[Length];
for (int i = 0; i < Length; i++)
{
_ints[i] = rng.Next();
_longs[i] = ((long)rng.Next() << 32) | (uint)rng.Next();
_bytes[i] = (byte)rng.Next(256);
}
}
[Benchmark] public int IntMax() => _ints.Max();
[Benchmark] public int IntMin() => _ints.Min();
[Benchmark] public long LongMax() => _longs.Max();
[Benchmark] public byte ByteMax() => _bytes.Max();
} |
|
PTAL @AndyAyersMS @dotnet/jit-contrib inliner changes please. I need this in other places as well, my work to replace unsafe code with safe too often runs into "out of time budget". Functions with SIMD are typically very rare outside of BCL so I don't expect any impact on users codebases. |
|
benchmarks seem happy without all the MethodImpl hacks EgorBot/Benchmarks#187 |
| case NI_IsSupported_False: | ||
| { | ||
| assert(sig->numArgs == 0); | ||
| impInlineRoot()->m_inlineStrategy->NoteHardwareIntrinsicCheckObserved(); |
There was a problem hiding this comment.
Do we want to make this observation for the false case, since presumably everything under that path is "dead code"?
There was a problem hiding this comment.
I assume this has been answered below
| CORINFO_CLASS_HANDLE typeArgHnd; | ||
| CorInfoType simdBaseJitType; | ||
|
|
||
| impInlineRoot()->m_inlineStrategy->NoteHardwareIntrinsicCheckObserved(); |
There was a problem hiding this comment.
Similar question here, do we want to make this observation in the case the only path is for an unsupported type? Will this negatively impact for example when T is System.__Canon and so all the intrinsics paths are dead/throwing?
There was a problem hiding this comment.
Sure we can (probably will cost us a JIT-EE call) but does it happen often? Also, the general assumption here is "method contains SIMD.IsSupported -> it's likely something we should allow ourselves to inline a bit more
There was a problem hiding this comment.
It occurs for most APIs involving Span vectorization, so all the cases of params ReadOnlySpan<object> args where you do args.SomeMethod()
It shouldn't require an additional JIT-EE call, as we already return constant true/false based on the switch just below.
I would think that we don't want the profitability boost for "anything", rather only for the cases that there could be live code. I would expect that we'd rather eliminate the definitively dead path or mark the definitively throwing path as cold early, and so have them excluded from the budgeting considerations (since we shouldn't be inlining anything in those blocks regardless and should rather pretend they don't exist).
There was a problem hiding this comment.
In other words, why boost the profitability by 5x for something that can never do anything but be a DCE or always throwing block?
There was a problem hiding this comment.
Ok, removed from NI_IsSupported_False, kept for NI_IsSupported_Dynamic
There was a problem hiding this comment.
I still think it's fine for us to be inprecise and IsSupported being false still may imply the current method is quite perf sensitive, even for unsupported T
| MAX_OVER_BUDGET_INTRINSIC_INLINES = 50 | ||
| MAX_OVER_BUDGET_INTRINSIC_INLINES = 50, | ||
|
|
||
| // When the root method or an already-imported inlinee references a |
There was a problem hiding this comment.
We still don't have a way to adjust how much budget was estimated to be used vs actually got used in the case most imported code is dead, right?
There was a problem hiding this comment.
We do make this adjustement:
runtime/src/coreclr/jit/flowgraph.cpp
Lines 570 to 583 in 4ffb643
This ultimately feeds into the budget update.
There was a problem hiding this comment.
Right, in this case it won't help us because there is nothing to delete - all paths can be used (at least on modern x64): 512, 256, 128, fallback. It's arm64 where we benefit from removing Vector256 and Vector512 dead weight.
So I needed more like a hint: this method contains a lot of IL, but the developer is more likely to assume all small helpers are inlined.
There was a problem hiding this comment.
all paths can be used (at least on modern x64): 512, 256, 128, fallback
Ah, I see. That is rather a problem with the current vectorized code structure in corelib and something I'd like to fix long term.
Ideally you aren't going if (V512 && length > V512.Count) { } if (V256 && length > V256.Count) { } if (V128 && length > V128.Count) { } scalar. This adds a lot of branches to hit the small case (likely common) and makes handling trailing elements really pessimized. -- Some of our code even duplicates the scalar loop or has a goto path at the beginning to handle scalar being common, which is also a bit blegh
The way TensorPrimitives has it setup is a bit nicer, as it ensures only one live path has to be considered and makes all the codegen overall smaller too, since we don't duplicate all that IL for cases that will never be hit.
Note
This PR is AI-generated.
This PR does several things:
MinMaxInteger(3) is needed because we run out of inliner budget in the safe version: https://gist.github.com/EgorBo/278e4bc6795cc829b9e129d7c5932f14
Benchmark
Summary — speedup of PR vs
main(>1 = PR faster, <1 = PR slower). Full results: EgorBot/Benchmarks#187.x64 (AMD EPYC, AVX-512) — significant wins from inliner-budget fixes + more efficient Vector256/512 → scalar reduction:
arm64 (Apple M2, Vector128) — mostly flat; minor regressions at small/medium lengths