Skip to content

Remove unsafe code from System.Linq.MaxMin#127845

Open
EgorBo wants to merge 2 commits intodotnet:mainfrom
EgorBo:reduce-unsafe-maxmin
Open

Remove unsafe code from System.Linq.MaxMin#127845
EgorBo wants to merge 2 commits intodotnet:mainfrom
EgorBo:reduce-unsafe-maxmin

Conversation

@EgorBo
Copy link
Copy Markdown
Member

@EgorBo EgorBo commented May 6, 2026

Note

This PR is AI-generated.

This PR does several things:

  1. Replace unsafe code with safe in MinMaxInteger
  2. Replace scalar loops over Vector512 and Vector256 with a reduction to Vector128 so we only run scalar loop over the smallest vector - this improves perf.
  3. Adds a bunch of [AI] to satisfy inliner

(3) is needed because we run out of inliner budget in the safe version: https://gist.github.com/EgorBo/278e4bc6795cc829b9e129d7c5932f14

Benchmark

public class Bench
{
    private int[] _ints = default!;
    private long[] _longs = default!;
    private byte[] _bytes = default!;

    [Params(2, 4, 16, 32, 200, 1000)]
    public int Length { get; set; }

    [GlobalSetup]
    public void Setup()
    {
        var rng = new Random(42);
        _ints = new int[Length];
        _longs = new long[Length];
        _bytes = new byte[Length];
        for (int i = 0; i < Length; i++)
        {
            _ints[i] = rng.Next();
            _longs[i] = ((long)rng.Next() << 32) | (uint)rng.Next();
            _bytes[i] = (byte)rng.Next(256);
        }
    }

    [Benchmark] public int IntMax() => _ints.Max();
    [Benchmark] public int IntMin() => _ints.Min();
    [Benchmark] public long LongMax() => _longs.Max();
    [Benchmark] public byte ByteMax() => _bytes.Max();
}

Summary — speedup of PR vs main (>1 = PR faster, <1 = PR slower). Full results: EgorBot/Benchmarks#187.

x64 (AMD EPYC, AVX-512) — significant wins from inliner-budget fixes + more efficient Vector256/512 → scalar reduction:

Method 2 4 16 32 200 1000
IntMax 1.03 1.07 9.41 2.79 5.42 1.89
IntMin 1.02 1.07 3.10 2.61 3.03 1.22
LongMax 1.15 1.92 9.19 2.41 1.06 1.06
ByteMax 2.42 3.37 1.26 3.03 6.52 7.76

arm64 (Apple M2, Vector128) — mostly flat; minor regressions at small/medium lengths

Method 2 4 16 32 200 1000
IntMax 0.92 0.95 0.83 0.99 1.08 1.01
IntMin 0.96 0.99 0.83 0.98 1.05 1.02
LongMax 1.03 0.91 0.99 0.99 1.00 1.00
ByteMax 0.99 1.00 0.98 1.00 0.92 1.03

Copilot AI review requested due to automatic review settings May 6, 2026 00:04

This comment was marked as resolved.

@EgorBo
Copy link
Copy Markdown
Member Author

EgorBo commented May 6, 2026

@MihuBot

@EgorBo

This comment was marked as resolved.

This comment was marked as resolved.

Copilot AI review requested due to automatic review settings May 6, 2026 12:11
@EgorBo EgorBo force-pushed the reduce-unsafe-maxmin branch from 15bf50d to 8837a91 Compare May 6, 2026 12:11
@EgorBo

This comment was marked as resolved.

This comment was marked as resolved.

@EgorBo EgorBo force-pushed the reduce-unsafe-maxmin branch from 4453202 to 9a4b7b1 Compare May 6, 2026 12:40
Copilot AI review requested due to automatic review settings May 6, 2026 12:46

This comment was marked as resolved.

Copilot AI review requested due to automatic review settings May 6, 2026 12:56

This comment was marked as resolved.

@EgorBo EgorBo force-pushed the reduce-unsafe-maxmin branch from 647f299 to 4eeec93 Compare May 6, 2026 14:45
@EgorBo

This comment was marked as resolved.

@EgorBo EgorBo marked this pull request as ready for review May 6, 2026 15:35
Copilot AI review requested due to automatic review settings May 6, 2026 15:35
@EgorBo
Copy link
Copy Markdown
Member Author

EgorBo commented May 6, 2026

@EgorBot -arm -amd

using System;
using System.Linq;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

public class Bench
{
    private int[] _ints = default!;
    private long[] _longs = default!;
    private byte[] _bytes = default!;

    [Params(2, 4, 16, 32, 200, 1000)]
    public int Length { get; set; }

    [GlobalSetup]
    public void Setup()
    {
        var rng = new Random(42);
        _ints = new int[Length];
        _longs = new long[Length];
        _bytes = new byte[Length];
        for (int i = 0; i < Length; i++)
        {
            _ints[i] = rng.Next();
            _longs[i] = ((long)rng.Next() << 32) | (uint)rng.Next();
            _bytes[i] = (byte)rng.Next(256);
        }
    }

    [Benchmark] public int IntMax() => _ints.Max();
    [Benchmark] public int IntMin() => _ints.Min();
    [Benchmark] public long LongMax() => _longs.Max();
    [Benchmark] public byte ByteMax() => _bytes.Max();
}

@EgorBo
Copy link
Copy Markdown
Member Author

EgorBo commented May 6, 2026

PTAL @MihaZupan @tannergooding

Had to mark the method NoInlining due to inliner issues I want to address separately. Currently it adds 0.5ns call overhead for small inputs (but only for trivial benchmark wrappers where the entire thing used to inline, in the real world it's very unlikely to happen) Otherwise, no extra bound checks in this impl in codegen.

benchmark: #127845 (comment) and results: EgorBot/Benchmarks#186
Significant wins on x64 due to inliner budget issues + more efficient "vector to scalar" path for Vector256 and Vector512.

Follow ups:

  • Improve inliner and remove NoInlining
  • Improve Vector128 to scalar for Max/Min via shuffle (wasn't the goal of this PR)

@EgorBo EgorBo requested review from MihaZupan and tannergooding May 6, 2026 16:42
Copy link
Copy Markdown
Member

@MihaZupan MihaZupan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub reviews are broken atm, so here's the comment I wanted to post:

Does this do anything since it's on the interface?

For the MethodImpl on private interface IMinMaxCalc<T>

@EgorBo
Copy link
Copy Markdown
Member Author

EgorBo commented May 6, 2026

GitHub reviews are broken atm, so here's the comment I wanted to post:

Does this do anything since it's on the interface?

For the MethodImpl on private interface IMinMaxCalc<T>

In this method we currently still run out of time budget, [AI] gave inliner a hint that we should inline this 14b method even we when already ran out of time budget. Otherwise I see non-inline calls in the codegen.

Also something I want to clean up in a separate inliner change. Although, in this case we just need to increase the budget based on some heuristics like Vector.IsSupported branches.

@MihaZupan
Copy link
Copy Markdown
Member

I meant specifically about adding the attribute on the interface's abstract definition itself.
Does the JIT look at that at all? I had assumed it'd only care about the implementation types.

Comment on lines 19 to 20
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static bool Compare(T left, T right) => left > right;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these not within the "always inline" threshold already?

I would've though this is like 5-6 bytes of IL and so something that we should trivially believe is in budget, particularly for primitive T

Maybe it being an op_GreaterThan constrained call is what's throwing things off?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this one was 14 bytes of IL, and we reduce "always inline" to 12 bytes once we ran out of time budget.

I removed this AggressiveInlining as it's no longer relevant with #127845 (comment)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I wonder if we should special case some of the IL sizes here like this one where it's namely big because of the constrained. <token> prefix.

We could of course special case all the generic math calls for the primitives as well, but it'd be nice if we didn't need to do that and have the inliner "do the right thing".

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in T1 we could do something like "inline everything <= 12 bytes" and then look at IL to make a heuristic for everything <= otherBound", such as if it is a single call or other special cases that are "likely profitable" but slightly larger.

@EgorBo EgorBo force-pushed the reduce-unsafe-maxmin branch from 4d1b2d4 to a8ebe80 Compare May 6, 2026 20:59
@EgorBo
Copy link
Copy Markdown
Member Author

EgorBo commented May 6, 2026

@MihaZupan @tannergooding so I decided to take another approach: if we already happen to inline a method with Vector<ANY>.IsSupported\IsHardwareSupported (or had it in the root compilation) we increase our time budget by 5x. The rationale is that all functions with different SIMD paths (e.g. a big block for Vector128, then for 256, then 512, etc) are IL heavy and contain tons of small calls and we'd better inline them all (but still not with infinite budget, just in case). This should allow me to remove all the ugly workarounds. Although, this is also a bit ad-hoc heuristic.

@EgorBo
Copy link
Copy Markdown
Member Author

EgorBo commented May 6, 2026

@EgorBot -arm -amd

using System;
using System.Linq;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

public class Bench
{
    private int[] _ints = default!;
    private long[] _longs = default!;
    private byte[] _bytes = default!;

    [Params(2, 4, 16, 32, 200, 1000)]
    public int Length { get; set; }

    [GlobalSetup]
    public void Setup()
    {
        var rng = new Random(42);
        _ints = new int[Length];
        _longs = new long[Length];
        _bytes = new byte[Length];
        for (int i = 0; i < Length; i++)
        {
            _ints[i] = rng.Next();
            _longs[i] = ((long)rng.Next() << 32) | (uint)rng.Next();
            _bytes[i] = (byte)rng.Next(256);
        }
    }

    [Benchmark] public int IntMax() => _ints.Max();
    [Benchmark] public int IntMin() => _ints.Min();
    [Benchmark] public long LongMax() => _longs.Max();
    [Benchmark] public byte ByteMax() => _bytes.Max();
}

@EgorBo
Copy link
Copy Markdown
Member Author

EgorBo commented May 6, 2026

PTAL @AndyAyersMS @dotnet/jit-contrib inliner changes please. I need this in other places as well, my work to replace unsafe code with safe too often runs into "out of time budget". Functions with SIMD are typically very rare outside of BCL so I don't expect any impact on users codebases.

@EgorBo EgorBo requested a review from AndyAyersMS May 6, 2026 21:08
@EgorBo
Copy link
Copy Markdown
Member Author

EgorBo commented May 6, 2026

benchmarks seem happy without all the MethodImpl hacks EgorBot/Benchmarks#187

Comment thread src/coreclr/jit/importercalls.cpp Outdated
case NI_IsSupported_False:
{
assert(sig->numArgs == 0);
impInlineRoot()->m_inlineStrategy->NoteHardwareIntrinsicCheckObserved();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to make this observation for the false case, since presumably everything under that path is "dead code"?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this has been answered below

CORINFO_CLASS_HANDLE typeArgHnd;
CorInfoType simdBaseJitType;

impInlineRoot()->m_inlineStrategy->NoteHardwareIntrinsicCheckObserved();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar question here, do we want to make this observation in the case the only path is for an unsupported type? Will this negatively impact for example when T is System.__Canon and so all the intrinsics paths are dead/throwing?

Copy link
Copy Markdown
Member Author

@EgorBo EgorBo May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure we can (probably will cost us a JIT-EE call) but does it happen often? Also, the general assumption here is "method contains SIMD.IsSupported -> it's likely something we should allow ourselves to inline a bit more

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs for most APIs involving Span vectorization, so all the cases of params ReadOnlySpan<object> args where you do args.SomeMethod()

It shouldn't require an additional JIT-EE call, as we already return constant true/false based on the switch just below.

I would think that we don't want the profitability boost for "anything", rather only for the cases that there could be live code. I would expect that we'd rather eliminate the definitively dead path or mark the definitively throwing path as cold early, and so have them excluded from the budgeting considerations (since we shouldn't be inlining anything in those blocks regardless and should rather pretend they don't exist).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, why boost the profitability by 5x for something that can never do anything but be a DCE or always throwing block?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, removed from NI_IsSupported_False, kept for NI_IsSupported_Dynamic

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it's fine for us to be inprecise and IsSupported being false still may imply the current method is quite perf sensitive, even for unsupported T

Comment thread src/coreclr/jit/inline.h
MAX_OVER_BUDGET_INTRINSIC_INLINES = 50
MAX_OVER_BUDGET_INTRINSIC_INLINES = 50,

// When the root method or an already-imported inlinee references a
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still don't have a way to adjust how much budget was estimated to be used vs actually got used in the case most imported code is dead, right?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do make this adjustement:

// Leave a note if we only did a partial import.
if (importedILSize != info.compILCodeSize)
{
JITDUMP("\n** Note: %s IL was partially imported -- imported %u of %u bytes of method IL\n",
compIsForInlining() ? "inlinee" : "root method", importedILSize, info.compILCodeSize);
}
// Record this for diagnostics and for the inliner's budget computations
info.compILImportSize = importedILSize;
if (compIsForInlining())
{
compInlineResult->SetImportedILSize(info.compILImportSize);
}

This ultimately feeds into the budget update.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, in this case it won't help us because there is nothing to delete - all paths can be used (at least on modern x64): 512, 256, 128, fallback. It's arm64 where we benefit from removing Vector256 and Vector512 dead weight.

So I needed more like a hint: this method contains a lot of IL, but the developer is more likely to assume all small helpers are inlined.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all paths can be used (at least on modern x64): 512, 256, 128, fallback

Ah, I see. That is rather a problem with the current vectorized code structure in corelib and something I'd like to fix long term.

Ideally you aren't going if (V512 && length > V512.Count) { } if (V256 && length > V256.Count) { } if (V128 && length > V128.Count) { } scalar. This adds a lot of branches to hit the small case (likely common) and makes handling trailing elements really pessimized. -- Some of our code even duplicates the scalar loop or has a goto path at the beginning to handle scalar being common, which is also a bit blegh

The way TensorPrimitives has it setup is a bit nicer, as it ensures only one live path has to be considered and makes all the codegen overall smaller too, since we don't duplicate all that IL for cases that will never be hit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants