Remove unsafe code from System.Linq.MaxMin by EgorBo · Pull Request #127845 · dotnet/runtime

EgorBo · 2026-05-06T00:04:12Z

Note

This PR is AI-generated.

This PR does several things:

Replace unsafe code with safe in MinMaxInteger
Replace scalar loops over Vector512 and Vector256 with a reduction to Vector128 so we only run scalar loop over the smallest vector - this improves perf.
Adds a bunch of [AI] to satisfy inliner

(3) is needed because we run out of inliner budget in the safe version: https://gist.github.com/EgorBo/278e4bc6795cc829b9e129d7c5932f14

Benchmark

public class Bench
{
    private int[] _ints = default!;
    private long[] _longs = default!;
    private byte[] _bytes = default!;

    [Params(2, 4, 16, 32, 200, 1000)]
    public int Length { get; set; }

    [GlobalSetup]
    public void Setup()
    {
        var rng = new Random(42);
        _ints = new int[Length];
        _longs = new long[Length];
        _bytes = new byte[Length];
        for (int i = 0; i < Length; i++)
        {
            _ints[i] = rng.Next();
            _longs[i] = ((long)rng.Next() << 32) | (uint)rng.Next();
            _bytes[i] = (byte)rng.Next(256);
        }
    }

    [Benchmark] public int IntMax() => _ints.Max();
    [Benchmark] public int IntMin() => _ints.Min();
    [Benchmark] public long LongMax() => _longs.Max();
    [Benchmark] public byte ByteMax() => _bytes.Max();
}

Summary — speedup of PR vs main (>1 = PR faster, <1 = PR slower). Full results: EgorBot/Benchmarks#187.

x64 (AMD EPYC, AVX-512) — significant wins from inliner-budget fixes + more efficient Vector256/512 → scalar reduction:

Method	2	4	16	32	200	1000
IntMax	1.03	1.07	9.41	2.79	5.42	1.89
IntMin	1.02	1.07	3.10	2.61	3.03	1.22
LongMax	1.15	1.92	9.19	2.41	1.06	1.06
ByteMax	2.42	3.37	1.26	3.03	6.52	7.76

arm64 (Apple M2, Vector128) — mostly flat; minor regressions at small/medium lengths

Method	2	4	16	32	200	1000
IntMax	0.92	0.95	0.83	0.99	1.08	1.01
IntMin	0.96	0.99	0.83	0.98	1.05	1.02
LongMax	1.03	0.91	0.99	0.99	1.00	1.00
ByteMax	0.99	1.00	0.98	1.00	0.92	1.03

EgorBo · 2026-05-06T00:12:28Z

@MihuBot

EgorBo · 2026-05-06T15:54:02Z

@EgorBot -arm -amd

using System;
using System.Linq;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

public class Bench
{
    private int[] _ints = default!;
    private long[] _longs = default!;
    private byte[] _bytes = default!;

    [Params(2, 4, 16, 32, 200, 1000)]
    public int Length { get; set; }

    [GlobalSetup]
    public void Setup()
    {
        var rng = new Random(42);
        _ints = new int[Length];
        _longs = new long[Length];
        _bytes = new byte[Length];
        for (int i = 0; i < Length; i++)
        {
            _ints[i] = rng.Next();
            _longs[i] = ((long)rng.Next() << 32) | (uint)rng.Next();
            _bytes[i] = (byte)rng.Next(256);
        }
    }

    [Benchmark] public int IntMax() => _ints.Max();
    [Benchmark] public int IntMin() => _ints.Min();
    [Benchmark] public long LongMax() => _longs.Max();
    [Benchmark] public byte ByteMax() => _bytes.Max();
}

EgorBo · 2026-05-06T16:40:46Z

PTAL @MihaZupan @tannergooding

Had to mark the method NoInlining due to inliner issues I want to address separately. Currently it adds 0.5ns call overhead for small inputs (but only for trivial benchmark wrappers where the entire thing used to inline, in the real world it's very unlikely to happen) Otherwise, no extra bound checks in this impl in codegen.

benchmark: #127845 (comment) and results: EgorBot/Benchmarks#186
Significant wins on x64 due to inliner budget issues + more efficient "vector to scalar" path for Vector256 and Vector512.

Follow ups:

Improve inliner and remove NoInlining
Improve Vector128 to scalar for Max/Min via shuffle (wasn't the goal of this PR)

MihaZupan

GitHub reviews are broken atm, so here's the comment I wanted to post:

Does this do anything since it's on the interface?

For the MethodImpl on private interface IMinMaxCalc<T>

EgorBo · 2026-05-06T18:42:57Z

GitHub reviews are broken atm, so here's the comment I wanted to post:

Does this do anything since it's on the interface?

For the MethodImpl on private interface IMinMaxCalc<T>

In this method we currently still run out of time budget, [AI] gave inliner a hint that we should inline this 14b method even we when already ran out of time budget. Otherwise I see non-inline calls in the codegen.

Also something I want to clean up in a separate inliner change. Although, in this case we just need to increase the budget based on some heuristics like Vector.IsSupported branches.

MihaZupan · 2026-05-06T18:47:51Z

I meant specifically about adding the attribute on the interface's abstract definition itself.
Does the JIT look at that at all? I had assumed it'd only care about the implementation types.

tannergooding · 2026-05-06T19:39:55Z

+            [MethodImpl(MethodImplOptions.AggressiveInlining)]
            public static bool Compare(T left, T right) => left > right;


Are these not within the "always inline" threshold already?

I would've though this is like 5-6 bytes of IL and so something that we should trivially believe is in budget, particularly for primitive T

Maybe it being an op_GreaterThan constrained call is what's throwing things off?

yeah this one was 14 bytes of IL, and we reduce "always inline" to 12 bytes once we ran out of time budget.

I removed this AggressiveInlining as it's no longer relevant with #127845 (comment)

Interesting, I wonder if we should special case some of the IL sizes here like this one where it's namely big because of the constrained. <token> prefix.

We could of course special case all the generic math calls for the primitives as well, but it'd be nice if we didn't need to do that and have the inliner "do the right thing".

Maybe in T1 we could do something like "inline everything <= 12 bytes" and then look at IL to make a heuristic for everything <= otherBound", such as if it is a single call or other special cases that are "likely profitable" but slightly larger.

EgorBo · 2026-05-06T21:02:11Z

@MihaZupan @tannergooding so I decided to take another approach: if we already happen to inline a method with Vector<ANY>.IsSupported\IsHardwareSupported (or had it in the root compilation) we increase our time budget by 5x. The rationale is that all functions with different SIMD paths (e.g. a big block for Vector128, then for 256, then 512, etc) are IL heavy and contain tons of small calls and we'd better inline them all (but still not with infinite budget, just in case). This should allow me to remove all the ugly workarounds. Although, this is also a bit ad-hoc heuristic.

EgorBo · 2026-05-06T21:02:21Z

@EgorBot -arm -amd

using System;
using System.Linq;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

public class Bench
{
    private int[] _ints = default!;
    private long[] _longs = default!;
    private byte[] _bytes = default!;

    [Params(2, 4, 16, 32, 200, 1000)]
    public int Length { get; set; }

    [GlobalSetup]
    public void Setup()
    {
        var rng = new Random(42);
        _ints = new int[Length];
        _longs = new long[Length];
        _bytes = new byte[Length];
        for (int i = 0; i < Length; i++)
        {
            _ints[i] = rng.Next();
            _longs[i] = ((long)rng.Next() << 32) | (uint)rng.Next();
            _bytes[i] = (byte)rng.Next(256);
        }
    }

    [Benchmark] public int IntMax() => _ints.Max();
    [Benchmark] public int IntMin() => _ints.Min();
    [Benchmark] public long LongMax() => _longs.Max();
    [Benchmark] public byte ByteMax() => _bytes.Max();
}

EgorBo · 2026-05-06T21:07:53Z

PTAL @AndyAyersMS @dotnet/jit-contrib inliner changes please. I need this in other places as well, my work to replace unsafe code with safe too often runs into "out of time budget". Functions with SIMD are typically very rare outside of BCL so I don't expect any impact on users codebases.

EgorBo · 2026-05-06T21:49:35Z

benchmarks seem happy without all the MethodImpl hacks EgorBot/Benchmarks#187

tannergooding · 2026-05-06T22:00:56Z

                case NI_IsSupported_False:
                {
                    assert(sig->numArgs == 0);
+                    impInlineRoot()->m_inlineStrategy->NoteHardwareIntrinsicCheckObserved();


Do we want to make this observation for the false case, since presumably everything under that path is "dead code"?

I assume this has been answered below

tannergooding · 2026-05-06T22:02:32Z

                    CORINFO_CLASS_HANDLE typeArgHnd;
                    CorInfoType          simdBaseJitType;

+                    impInlineRoot()->m_inlineStrategy->NoteHardwareIntrinsicCheckObserved();


Similar question here, do we want to make this observation in the case the only path is for an unsupported type? Will this negatively impact for example when T is System.__Canon and so all the intrinsics paths are dead/throwing?

Sure we can (probably will cost us a JIT-EE call) but does it happen often? Also, the general assumption here is "method contains SIMD.IsSupported -> it's likely something we should allow ourselves to inline a bit more

It occurs for most APIs involving Span vectorization, so all the cases of params ReadOnlySpan<object> args where you do args.SomeMethod()

It shouldn't require an additional JIT-EE call, as we already return constant true/false based on the switch just below.

I would think that we don't want the profitability boost for "anything", rather only for the cases that there could be live code. I would expect that we'd rather eliminate the definitively dead path or mark the definitively throwing path as cold early, and so have them excluded from the budgeting considerations (since we shouldn't be inlining anything in those blocks regardless and should rather pretend they don't exist).

In other words, why boost the profitability by 5x for something that can never do anything but be a DCE or always throwing block?

Ok, removed from NI_IsSupported_False, kept for NI_IsSupported_Dynamic

I still think it's fine for us to be inprecise and IsSupported being false still may imply the current method is quite perf sensitive, even for unsupported T

tannergooding · 2026-05-06T22:06:45Z

-        MAX_OVER_BUDGET_INTRINSIC_INLINES = 50
+        MAX_OVER_BUDGET_INTRINSIC_INLINES = 50,
+
+        // When the root method or an already-imported inlinee references a


We still don't have a way to adjust how much budget was estimated to be used vs actually got used in the case most imported code is dead, right?

We do make this adjustement:

runtime/src/coreclr/jit/flowgraph.cpp

Lines 570 to 583 in 4ffb643

// Leave a note if we only did a partial import.

if (importedILSize != info.compILCodeSize)

{

JITDUMP("\n** Note: %s IL was partially imported -- imported %u of %u bytes of method IL\n",

compIsForInlining() ? "inlinee" : "root method", importedILSize, info.compILCodeSize);

}

// Record this for diagnostics and for the inliner's budget computations

info.compILImportSize = importedILSize;

if (compIsForInlining())

{

compInlineResult->SetImportedILSize(info.compILImportSize);

}

This ultimately feeds into the budget update.

Right, in this case it won't help us because there is nothing to delete - all paths can be used (at least on modern x64): 512, 256, 128, fallback. It's arm64 where we benefit from removing Vector256 and Vector512 dead weight.

So I needed more like a hint: this method contains a lot of IL, but the developer is more likely to assume all small helpers are inlined.

all paths can be used (at least on modern x64): 512, 256, 128, fallback

Ah, I see. That is rather a problem with the current vectorized code structure in corelib and something I'd like to fix long term.

Ideally you aren't going if (V512 && length > V512.Count) { } if (V256 && length > V256.Count) { } if (V128 && length > V128.Count) { } scalar. This adds a lot of branches to hit the small case (likely common) and makes handling trailing elements really pessimized. -- Some of our code even duplicates the scalar loop or has a goto path at the beginning to handle scalar being common, which is also a bit blegh

The way TensorPrimitives has it setup is a bit nicer, as it ensures only one live path has to be considered and makes all the codegen overall smaller too, since we don't duplicate all that IL for cases that will never be hit.

Copilot AI review requested due to automatic review settings May 6, 2026 00:04

EgorBo added the reduce-unsafe label May 6, 2026

Copilot started reviewing on behalf of EgorBo May 6, 2026 00:04 View session

dotnet-policy-service Bot assigned EgorBo May 6, 2026

github-actions Bot added the area-System.Linq label May 6, 2026