mlas/arm64: Add AArch64 assembly path for NCHWc float kernel and wire into build by milpuz01 · Pull Request #27788 · microsoft/onnxruntime

milpuz01 · 2026-03-20T21:15:30Z

Description

This change introduces a new AArch64 assembly owner for the NCHWc float convolution path and keeps the existing C++ entrypoint stable, with build wiring so the same kernel entry is used transparently by existing call sites.

The optimization strategy is centered on three ideas:

Route the stable NCHWc C++ entrypoint to an AArch64 assembly implementation.
Split execution into left/middle/right regions, with a 4-output center tile and 3/2/1 remainder kernels in the middle region.
Duplicate hot loops for flags == 0 so the no-post-op path avoids repeated accumulate/bias/ReLU flag checks inside the hottest loops.

Motivation and Context

This improves throughput while preserving existing behavior and integration points and address comment #27099 (comment) from that PR (cc: @Rohanjames1997 and @hariharans29).

Per-convolution improvements from convolutions in the model from: #25580 (comment) per core count when running on AWS Graviton 4 (all of these convolution have common N = 1, KH = 3, KW = 3, DH = 1, DW = 1 and G = 1)

Cores	IC	OC	IH	IW	OH	OW	SH	SW	PT	PL	PB	PR	P90 Improvement	P99 Improvement
1	32	32	192	192	192	192	1	1	1	1	1	1	+47.57%	+43.64%
	32	96	192	192	96	96	2	2	0	0	1	1	+47.56%	+44.69%
	48	192	96	96	48	48	2	2	0	0	1	1	+47.25%	+45.12%
	48	192	96	96	96	96	1	1	1	1	1	1	+48.73%	+47.07%
	64	256	48	48	48	48	1	1	1	1	1	1	+47.73%	+47.76%
2	32	32	192	192	192	192	1	1	1	1	1	1	+47.72%	+42.78%
	32	96	192	192	96	96	2	2	0	0	1	1	+48.41%	+47.76%
	48	192	96	96	48	48	2	2	0	0	1	1	+47.13%	+44.62%
	48	192	96	96	96	96	1	1	1	1	1	1	+48.58%	+46.65%
	64	256	48	48	48	48	1	1	1	1	1	1	+47.59%	+47.32%
4	32	32	192	192	192	192	1	1	1	1	1	1	+45.18%	+38.51%
	32	96	192	192	96	96	2	2	0	0	1	1	+46.34%	+45.25%
	48	192	96	96	48	48	2	2	0	0	1	1	+46.17%	+42.45%
	48	192	96	96	96	96	1	1	1	1	1	1	+48.20%	+45.25%
	64	256	48	48	48	48	1	1	1	1	1	1	+47.32%	+46.67%
8	32	32	192	192	192	192	1	1	1	1	1	1	+40.07%	+25.57%
	32	96	192	192	96	96	2	2	0	0	1	1	+42.93%	+45.06%
	48	192	96	96	48	48	2	2	0	0	1	1	+44.50%	+38.86%
	48	192	96	96	96	96	1	1	1	1	1	1	+47.31%	+45.22%
	64	256	48	48	48	48	1	1	1	1	1	1	+47.02%	+46.42%
16	32	32	192	192	192	192	1	1	1	1	1	1	+34.06%	+28.48%
	32	96	192	192	96	96	2	2	0	0	1	1	+38.12%	+39.83%
	48	192	96	96	48	48	2	2	0	0	1	1	+42.17%	+35.48%
	48	192	96	96	96	96	1	1	1	1	1	1	+45.25%	+42.73%
	64	256	48	48	48	48	1	1	1	1	1	1	+46.44%	+45.84%

And end-to-end performance improvement when running the model on different core count:

Model	Cores	P90 Improvement	P99 Improvement
shareable_model.onnx	1	+24.35%	+24.17%
shareable_model.onnx	2	+24.14%	+23.82%
shareable_model.onnx	4	+23.40%	+23.22%
shareable_model.onnx	8	+22.72%	+22.55%
shareable_model.onnx	16	+20.19%	+19.89%

Running bench_sconv_nchwc.cpp benchmark on AWS Graviton 4 c8g.16xlarge the following is obtained:

$ ./build/Linux/Release/onnxruntime_mlas_benchmark --benchmark_filter='SCONV_NCHWC_DIRECT/DirectNchwcCases'
The number of inputs is very large. QNBITGEMM<float, 4> will be repeated at least 128 times.
The number of inputs is very large. QNBITGEMM<float, 8> will be repeated at least 128 times.
2026-03-23T17:35:58+00:00
Running ./build/Linux/Release/onnxruntime_mlas_benchmark
Run on (64 X 2000 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x64)
  L1 Instruction 64 KiB (x64)
  L2 Unified 2048 KiB (x64)
  L3 Unified 36864 KiB (x1)
Load Average: 0.01, 0.13, 0.36
--------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                              Time             CPU   Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------
SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:32/OC:32/IH:192/IW:192/KH:3/KW:3/PT:1/PL:1/PB:1/PR:1/S:1/D:1/real_time    9169970 ns      9170100 ns           76
SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:32/OC:96/IH:192/IW:192/KH:3/KW:3/PT:0/PL:0/PB:1/PR:1/S:2/D:1/real_time    7048829 ns      7047903 ns           99
SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:48/OC:192/IH:96/IW:96/KH:3/KW:3/PT:0/PL:0/PB:1/PR:1/S:2/D:1/real_time     5422228 ns      5421912 ns          129
SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:48/OC:192/IH:96/IW:96/KH:3/KW:3/PT:1/PL:1/PB:1/PR:1/S:1/D:1/real_time    22050813 ns     22047140 ns           31
SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:64/OC:256/IH:48/IW:48/KH:3/KW:3/PT:1/PL:1/PB:1/PR:1/S:1/D:1/real_time     9856375 ns      9855928 ns           71

Running the same benchmark on the same system without this PR we get:

Running ./build/Linux/Release/onnxruntime_mlas_benchmark
Run on (64 X 2000 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x64)
  L1 Instruction 64 KiB (x64)
  L2 Unified 2048 KiB (x64)
  L3 Unified 36864 KiB (x1)
Load Average: 24.34, 13.45, 5.35
--------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                              Time             CPU   Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------
SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:32/OC:32/IH:192/IW:192/KH:3/KW:3/PT:1/PL:1/PB:1/PR:1/S:1/D:1/real_time   17576371 ns     17576571 ns           40
SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:32/OC:96/IH:192/IW:192/KH:3/KW:3/PT:0/PL:0/PB:1/PR:1/S:2/D:1/real_time   14319665 ns     14316035 ns           49
SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:48/OC:192/IH:96/IW:96/KH:3/KW:3/PT:0/PL:0/PB:1/PR:1/S:2/D:1/real_time    11710396 ns     11707113 ns           60
SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:48/OC:192/IH:96/IW:96/KH:3/KW:3/PT:1/PL:1/PB:1/PR:1/S:1/D:1/real_time    47237327 ns     47217289 ns           15
SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:64/OC:256/IH:48/IW:48/KH:3/KW:3/PT:1/PL:1/PB:1/PR:1/S:1/D:1/real_time    21110794 ns     21103617 ns           33

which shows the following speed up:

Benchmark row	First run (ns)	Second run (ns)	Speedup (first vs second)
IC32 OC32 IH192 IW192 KH3 KW3 PT1 PL1 PB1 PR1 S1 D1	9,169,970	17,576,371	1.92x
IC32 OC96 IH192 IW192 KH3 KW3 PT0 PL0 PB1 PR1 S2 D1	7,048,829	14,319,665	2.03x
IC48 OC192 IH96 IW96 KH3 KW3 PT0 PL0 PB1 PR1 S2 D1	5,422,228	11,710,396	2.16x
IC48 OC192 IH96 IW96 KH3 KW3 PT1 PL1 PB1 PR1 S1 D1	22,050,813	47,237,327	2.14x
IC64 OC256 IH48 IW48 KH3 KW3 PT1 PL1 PB1 PR1 S1 D1	9,856,375	21,110,794	2.14x

… into build Signed-off-by: Milos Puzovic <milos.puzovic@arm.com>

Copilot

Pull request overview

This PR adds an AArch64 NEON assembly implementation for the MLAS direct NCHWc float convolution kernel and wires it into the existing C++ entrypoint/build so the optimized path is selected transparently on supported ARM64 non-Windows builds.

Changes:

Route MlasConvNchwcFloatKernelNeon to a new MlasConvNchwcFloatKernelNeonAsm implementation on ARM64 + NCHWc + non-Windows.
Introduce a new assembly kernel (SconvNchwcKernelNeon.S) implementing left/middle/right regions and multi-output tiling (4/3/2/1).
Wire the new assembly source into setup_arm_neon_nchwc() and expose the new symbol via mlasi.h.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
onnxruntime/core/mlas/lib/sconv_nchwc_kernel_neon.cpp	Dispatches the stable NCHWc C++ entrypoint to the new asm kernel on supported builds.
onnxruntime/core/mlas/lib/mlasi.h	Declares the new asm kernel symbol for non-Windows builds.
onnxruntime/core/mlas/lib/aarch64/SconvNchwcKernelNeon.S	Adds the AArch64 NEON assembly micro-kernel implementation for NCHWc float conv.
cmake/onnxruntime_mlas.cmake	Adds the new assembly file to the ARM NEON NCHWc build sources (non-Windows).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

hariharans29 · 2026-03-20T22:01:34Z

Thanks - can you please resolve Copilot comments ?

hariharans29 · 2026-03-20T22:13:13Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-03-20T22:13:31Z

Azure Pipelines successfully started running 4 pipeline(s).

hariharans29 · 2026-03-20T22:15:48Z

a 4-output center tile - Does this mean 4 output pixels at a time in the interior path ? How many filter counts are processed in that pass ? Is it 2 like the NCHW kernel (or 1 filter only due to register budget constraints) ?

Signed-off-by: Milos Puzovic <milos.puzovic@arm.com>

milpuz01 · 2026-03-23T12:43:43Z

Thanks - can you please resolve Copilot comments ?

Done. The fixes are in: 5e0dc2d

milpuz01 · 2026-03-23T12:45:07Z

a 4-output center tile - Does this mean 4 output pixels at a time in the interior path ? How many filter counts are processed in that pass ? Is it 2 like the NCHW kernel (or 1 filter only due to register budget constraints) ?

Yes, “4-output center tile” here means 4 spatial output positions (ow..ow+3) computed together in the interior (middle) path where bounds checks are not needed.

For filter count, this kernel still iterates filter blocks in the outer loop (x3 = FilterCount). Each iteration handles one NCHWc filter block (16 output channels), so the 4-output tiling is spatial only.

So unlike the NCHW kernel’s 2-output/filter-block pairing, this path is not “2 filter counts at once”; it is 4 spatial outputs per step, with filter blocks processed one-by-one in the existing loop.

hariharans29 · 2026-03-23T17:29:56Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-03-23T17:30:22Z

Azure Pipelines successfully started running 4 pipeline(s).

- Fix SconvNchwcKernelNeon.S KernelFlags==0 remainder path temp spill slot (sp+microsoft#120 -> sp+microsoft#248) to avoid clobbering callee-saved SIMD spills (q8-q15) - Replace \@-based local labels with numeric local labels (90f/91b style) for portable assembly parsing (including macOS toolchains) Signed-off-by: Milos Puzovic <milos.puzovic@arm.com>

hariharans29 · 2026-03-23T21:13:21Z

it seems like there are some test failures with this change

hariharans29 · 2026-03-23T21:35:24Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-03-23T21:35:43Z

Azure Pipelines successfully started running 4 pipeline(s).

milpuz01 · 2026-03-23T21:40:32Z

it seems like there are some test failures with this change

Yes, I was able to reproduce them. The fix is in bd07f63 and also it fixes failures with labels that were used that couldn't compile on MacOS. Running the tests now they pass:

$ ./build/Linux/Release/onnxruntime_test_all --gtest_filter='NchwcOptimizerTests.*'
Note: Google Test filter = NchwcOptimizerTests.*
[==========] Running 31 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 31 tests from NchwcOptimizerTests
[ RUN      ] NchwcOptimizerTests.ConvNchw
[       OK ] NchwcOptimizerTests.ConvNchw (138 ms)
[ RUN      ] NchwcOptimizerTests.ConvNchwc
[       OK ] NchwcOptimizerTests.ConvNchwc (59 ms)
[ RUN      ] NchwcOptimizerTests.ConvNchwcGrouped
[       OK ] NchwcOptimizerTests.ConvNchwcGrouped (39 ms)
[ RUN      ] NchwcOptimizerTests.ConvDepthwise
[       OK ] NchwcOptimizerTests.ConvDepthwise (48 ms)
[ RUN      ] NchwcOptimizerTests.ConvPointwise
[       OK ] NchwcOptimizerTests.ConvPointwise (39 ms)
[ RUN      ] NchwcOptimizerTests.ConvMaxPool
[       OK ] NchwcOptimizerTests.ConvMaxPool (11 ms)
[ RUN      ] NchwcOptimizerTests.ConvMaxPoolDilations
[       OK ] NchwcOptimizerTests.ConvMaxPoolDilations (10 ms)
[ RUN      ] NchwcOptimizerTests.ConvAveragePool
[       OK ] NchwcOptimizerTests.ConvAveragePool (12 ms)
[ RUN      ] NchwcOptimizerTests.ConvGlobalPool
[       OK ] NchwcOptimizerTests.ConvGlobalPool (13 ms)
[ RUN      ] NchwcOptimizerTests.ConvAddFusion
[       OK ] NchwcOptimizerTests.ConvAddFusion (120 ms)
[ RUN      ] NchwcOptimizerTests.ConvNoBiasAddFusion
[       OK ] NchwcOptimizerTests.ConvNoBiasAddFusion (4 ms)
[ RUN      ] NchwcOptimizerTests.FusedConvAddFusion
[       OK ] NchwcOptimizerTests.FusedConvAddFusion (20 ms)
[ RUN      ] NchwcOptimizerTests.ConvBinary
[       OK ] NchwcOptimizerTests.ConvBinary (15 ms)
[ RUN      ] NchwcOptimizerTests.ConvBinaryBroadcast
[       OK ] NchwcOptimizerTests.ConvBinaryBroadcast (9 ms)
[ RUN      ] NchwcOptimizerTests.ConvConcat
[       OK ] NchwcOptimizerTests.ConvConcat (20 ms)
[ RUN      ] NchwcOptimizerTests.ConvReuseWeightsOIHWBiBo
[       OK ] NchwcOptimizerTests.ConvReuseWeightsOIHWBiBo (5 ms)
[ RUN      ] NchwcOptimizerTests.ConvReuseWeightsOIHWBo
[       OK ] NchwcOptimizerTests.ConvReuseWeightsOIHWBo (4 ms)
[ RUN      ] NchwcOptimizerTests.ShapeInferencing
[       OK ] NchwcOptimizerTests.ShapeInferencing (7 ms)
[ RUN      ] NchwcOptimizerTests.ShapeInferencing2
[       OK ] NchwcOptimizerTests.ShapeInferencing2 (5 ms)
[ RUN      ] NchwcOptimizerTests.MixedOutputUsage
[       OK ] NchwcOptimizerTests.MixedOutputUsage (5 ms)
[ RUN      ] NchwcOptimizerTests.TensorAlignment
[       OK ] NchwcOptimizerTests.TensorAlignment (6 ms)
[ RUN      ] NchwcOptimizerTests.IntermediatesAsGraphOutputs
[       OK ] NchwcOptimizerTests.IntermediatesAsGraphOutputs (5 ms)
[ RUN      ] NchwcOptimizerTests.BatchNormalization
[       OK ] NchwcOptimizerTests.BatchNormalization (5 ms)
[ RUN      ] NchwcOptimizerTests.ConvReorderInputNhwc
[       OK ] NchwcOptimizerTests.ConvReorderInputNhwc (25 ms)
[ RUN      ] NchwcOptimizerTests.ConvReorderOutputNhwc
[       OK ] NchwcOptimizerTests.ConvReorderOutputNhwc (5 ms)
[ RUN      ] NchwcOptimizerTests.ConvReorderOutputBoth
[       OK ] NchwcOptimizerTests.ConvReorderOutputBoth (5 ms)
[ RUN      ] NchwcOptimizerTests.ConvReorderOutputCnhw
[       OK ] NchwcOptimizerTests.ConvReorderOutputCnhw (5 ms)
[ RUN      ] NchwcOptimizerTests.UpsampleNearest
[       OK ] NchwcOptimizerTests.UpsampleNearest (162 ms)
[ RUN      ] NchwcOptimizerTests.UpsampleLinear
[       OK ] NchwcOptimizerTests.UpsampleLinear (499 ms)
[ RUN      ] NchwcOptimizerTests.Activation
[       OK ] NchwcOptimizerTests.Activation (16 ms)
[ RUN      ] NchwcOptimizerTests.MaxPoolTypeCheck
[       OK ] NchwcOptimizerTests.MaxPoolTypeCheck (4 ms)
[----------] 31 tests from NchwcOptimizerTests (1334 ms total)

[----------] Global test environment tear-down
[==========] 31 tests from 1 test suite ran. (1334 ms total)
[  PASSED  ] 31 tests.

hariharans29 · 2026-03-23T21:42:24Z

it seems like there are some test failures with this change

Yes, I was able to reproduce them. The fix is in bd07f63 and also it fixes failures with labels that were used that couldn't compile on MacOS. Running the tests now they pass:

$ ./build/Linux/Release/onnxruntime_test_all --gtest_filter='NchwcOptimizerTests.*'
Note: Google Test filter = NchwcOptimizerTests.*
[==========] Running 31 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 31 tests from NchwcOptimizerTests
[ RUN      ] NchwcOptimizerTests.ConvNchw
[       OK ] NchwcOptimizerTests.ConvNchw (138 ms)
[ RUN      ] NchwcOptimizerTests.ConvNchwc
[       OK ] NchwcOptimizerTests.ConvNchwc (59 ms)
[ RUN      ] NchwcOptimizerTests.ConvNchwcGrouped
[       OK ] NchwcOptimizerTests.ConvNchwcGrouped (39 ms)
[ RUN      ] NchwcOptimizerTests.ConvDepthwise
[       OK ] NchwcOptimizerTests.ConvDepthwise (48 ms)
[ RUN      ] NchwcOptimizerTests.ConvPointwise
[       OK ] NchwcOptimizerTests.ConvPointwise (39 ms)
[ RUN      ] NchwcOptimizerTests.ConvMaxPool
[       OK ] NchwcOptimizerTests.ConvMaxPool (11 ms)
[ RUN      ] NchwcOptimizerTests.ConvMaxPoolDilations
[       OK ] NchwcOptimizerTests.ConvMaxPoolDilations (10 ms)
[ RUN      ] NchwcOptimizerTests.ConvAveragePool
[       OK ] NchwcOptimizerTests.ConvAveragePool (12 ms)
[ RUN      ] NchwcOptimizerTests.ConvGlobalPool
[       OK ] NchwcOptimizerTests.ConvGlobalPool (13 ms)
[ RUN      ] NchwcOptimizerTests.ConvAddFusion
[       OK ] NchwcOptimizerTests.ConvAddFusion (120 ms)
[ RUN      ] NchwcOptimizerTests.ConvNoBiasAddFusion
[       OK ] NchwcOptimizerTests.ConvNoBiasAddFusion (4 ms)
[ RUN      ] NchwcOptimizerTests.FusedConvAddFusion
[       OK ] NchwcOptimizerTests.FusedConvAddFusion (20 ms)
[ RUN      ] NchwcOptimizerTests.ConvBinary
[       OK ] NchwcOptimizerTests.ConvBinary (15 ms)
[ RUN      ] NchwcOptimizerTests.ConvBinaryBroadcast
[       OK ] NchwcOptimizerTests.ConvBinaryBroadcast (9 ms)
[ RUN      ] NchwcOptimizerTests.ConvConcat
[       OK ] NchwcOptimizerTests.ConvConcat (20 ms)
[ RUN      ] NchwcOptimizerTests.ConvReuseWeightsOIHWBiBo
[       OK ] NchwcOptimizerTests.ConvReuseWeightsOIHWBiBo (5 ms)
[ RUN      ] NchwcOptimizerTests.ConvReuseWeightsOIHWBo
[       OK ] NchwcOptimizerTests.ConvReuseWeightsOIHWBo (4 ms)
[ RUN      ] NchwcOptimizerTests.ShapeInferencing
[       OK ] NchwcOptimizerTests.ShapeInferencing (7 ms)
[ RUN      ] NchwcOptimizerTests.ShapeInferencing2
[       OK ] NchwcOptimizerTests.ShapeInferencing2 (5 ms)
[ RUN      ] NchwcOptimizerTests.MixedOutputUsage
[       OK ] NchwcOptimizerTests.MixedOutputUsage (5 ms)
[ RUN      ] NchwcOptimizerTests.TensorAlignment
[       OK ] NchwcOptimizerTests.TensorAlignment (6 ms)
[ RUN      ] NchwcOptimizerTests.IntermediatesAsGraphOutputs
[       OK ] NchwcOptimizerTests.IntermediatesAsGraphOutputs (5 ms)
[ RUN      ] NchwcOptimizerTests.BatchNormalization
[       OK ] NchwcOptimizerTests.BatchNormalization (5 ms)
[ RUN      ] NchwcOptimizerTests.ConvReorderInputNhwc
[       OK ] NchwcOptimizerTests.ConvReorderInputNhwc (25 ms)
[ RUN      ] NchwcOptimizerTests.ConvReorderOutputNhwc
[       OK ] NchwcOptimizerTests.ConvReorderOutputNhwc (5 ms)
[ RUN      ] NchwcOptimizerTests.ConvReorderOutputBoth
[       OK ] NchwcOptimizerTests.ConvReorderOutputBoth (5 ms)
[ RUN      ] NchwcOptimizerTests.ConvReorderOutputCnhw
[       OK ] NchwcOptimizerTests.ConvReorderOutputCnhw (5 ms)
[ RUN      ] NchwcOptimizerTests.UpsampleNearest
[       OK ] NchwcOptimizerTests.UpsampleNearest (162 ms)
[ RUN      ] NchwcOptimizerTests.UpsampleLinear
[       OK ] NchwcOptimizerTests.UpsampleLinear (499 ms)
[ RUN      ] NchwcOptimizerTests.Activation
[       OK ] NchwcOptimizerTests.Activation (16 ms)
[ RUN      ] NchwcOptimizerTests.MaxPoolTypeCheck
[       OK ] NchwcOptimizerTests.MaxPoolTypeCheck (4 ms)
[----------] 31 tests from NchwcOptimizerTests (1334 ms total)

[----------] Global test environment tear-down
[==========] 31 tests from 1 test suite ran. (1334 ms total)
[  PASSED  ] 31 tests.

Thanks. I take it that the clobbering fix didn't have any tangible perf loss (i.e.) are the numbers pasted in the PR still true ?

hariharans29 · 2026-03-23T21:43:28Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-03-23T21:43:44Z

Azure Pipelines successfully started running 4 pipeline(s).

milpuz01 · 2026-03-23T21:46:22Z

Thanks. I take it that the clobbering fix didn't have any tangible perf loss (i.e.) are the numbers pasted in the PR still true ?

Yes, no performance loss. The numbers are the same as pasted in the PR.

hariharans29 · 2026-03-23T23:51:09Z

Still has some failures - it seems like ?

…kage Signed-off-by: Milos Puzovic <milos.puzovic@arm.com>

milpuz01 · 2026-03-24T09:53:44Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-03-24T09:53:52Z

Commenter does not have sufficient privileges for PR 27788 in repo microsoft/onnxruntime

milpuz01 · 2026-03-24T09:58:16Z

Still has some failures - it seems like ?

Yes, it is MacOS failure. In the code we used x18 as a general-purpose register, but that register is reserved by the platform ABI. We removed in b928c7d the usage and are keeping the variable that was stored that before on a stack. Rerun it on both MacOS and Linux and increase in extra stack pill doesn't have any impact on reported benchmark results.

I have tried to rerun the pipelines but I do not have sufficient privileges to rerun.

hariharans29 · 2026-03-24T17:25:23Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-03-24T17:25:43Z

Azure Pipelines successfully started running 4 pipeline(s).

hariharans29 · 2026-03-24T17:26:19Z

Still has some failures - it seems like ?

Yes, it is MacOS failure. In the code we used x18 as a general-purpose register, but that register is reserved by the platform ABI. We removed in b928c7d the usage and are keeping the variable that was stored that before on a stack. Rerun it on both MacOS and Linux and increase in extra stack pill doesn't have any impact on reported benchmark results.

I have tried to rerun the pipelines but I do not have sufficient privileges to rerun.

Thanks for the fixes. I ll re-run the pipelines. I ve also asked Copilot for one more review. Provided all is good, I will merge it, thanks

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

milpuz01 · 2026-03-24T20:54:05Z

I ve also asked Copilot for one more review. Provided all is good, I will merge it, thanks

Thank you! I have addressed the Copilot reviews. I reran the benchmark after the changes and this PR slightly improved per case by 0.1x, but I do not think that is the big change. Before I committed the change I believe that there were no failures in the pipeline.

hariharans29 · 2026-03-24T20:58:19Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-03-24T20:58:39Z

Azure Pipelines successfully started running 4 pipeline(s).

mlas/arm64: Add AArch64 assembly path for NCHWc float kernel and wire…

22dbed4

… into build Signed-off-by: Milos Puzovic <milos.puzovic@arm.com>

hariharans29 requested a review from Copilot March 20, 2026 21:53

Copilot started reviewing on behalf of hariharans29 March 20, 2026 21:54 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

Comment thread onnxruntime/core/mlas/lib/aarch64/SconvNchwcKernelNeon.S Outdated

Comment thread onnxruntime/core/mlas/lib/aarch64/SconvNchwcKernelNeon.S Outdated

Mach-O / Apple portability and AArch64 ABI correctness fix

5e0dc2d

Signed-off-by: Milos Puzovic <milos.puzovic@arm.com>

hariharans29 mentioned this pull request Mar 21, 2026

[DO NOT REVIEW]: Support NCHWc activations x NCHWc weights kernel on non-Windows ARM64 platforms #27790

Closed

hariharans29 reviewed Mar 22, 2026

View reviewed changes

Comment thread onnxruntime/core/mlas/lib/sconv_nchwc_kernel_neon.cpp

Added benchmark for micro-kernel MlasConvNchwcFloatKernelNeon

0cdec80

Signed-off-by: Milos Puzovic <milos.puzovic@arm.com>

milpuz01 requested a review from hariharans29 March 23, 2026 12:46

Avoid using x18/w18 as general purpose register to fix MacOS ABI brea…

b928c7d

…kage Signed-off-by: Milos Puzovic <milos.puzovic@arm.com>

Rohanjames1997 reviewed Mar 24, 2026

View reviewed changes

Comment thread onnxruntime/core/mlas/lib/aarch64/SconvNchwcKernelNeon.S

hariharans29 requested a review from Copilot March 24, 2026 17:25

Copilot started reviewing on behalf of hariharans29 March 24, 2026 17:26 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

Comment thread onnxruntime/test/mlas/bench/bench_sconv_nchwc.cpp Outdated

Comment thread onnxruntime/test/mlas/bench/bench_sconv_nchwc.cpp Outdated

Address comments from Copilot

89809bb

hariharans29 enabled auto-merge (squash) March 24, 2026 20:58

hariharans29 reviewed Mar 24, 2026

View reviewed changes

Comment thread onnxruntime/test/mlas/bench/bench_sconv_nchwc.cpp

hariharans29 disabled auto-merge March 24, 2026 21:02

hariharans29 approved these changes Mar 25, 2026

View reviewed changes

hariharans29 enabled auto-merge (squash) March 25, 2026 17:38

hariharans29 merged commit e10ea97 into microsoft:main Mar 25, 2026
101 of 103 checks passed

milpuz01 deleted the mlas_nchwc_conv branch March 26, 2026 10:48

BrewTestBot mentioned this pull request Apr 20, 2026

onnxruntime 1.25.0 Homebrew/homebrew-core#278543

Merged

Conversation

milpuz01 commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

hariharans29 commented Mar 20, 2026

Uh oh!

hariharans29 commented Mar 20, 2026

Uh oh!

azure-pipelines Bot commented Mar 20, 2026

Uh oh!

hariharans29 commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

milpuz01 commented Mar 23, 2026

Uh oh!

milpuz01 commented Mar 23, 2026

Uh oh!

hariharans29 commented Mar 23, 2026

Uh oh!

azure-pipelines Bot commented Mar 23, 2026

Uh oh!

hariharans29 commented Mar 23, 2026

Uh oh!

hariharans29 commented Mar 23, 2026

Uh oh!

azure-pipelines Bot commented Mar 23, 2026

Uh oh!

milpuz01 commented Mar 23, 2026

Uh oh!

hariharans29 commented Mar 23, 2026

Uh oh!

hariharans29 commented Mar 23, 2026

Uh oh!

azure-pipelines Bot commented Mar 23, 2026

Uh oh!

milpuz01 commented Mar 23, 2026

Uh oh!

hariharans29 commented Mar 23, 2026

Uh oh!

milpuz01 commented Mar 24, 2026

Uh oh!

azure-pipelines Bot commented Mar 24, 2026

Uh oh!

milpuz01 commented Mar 24, 2026

Uh oh!

Uh oh!

hariharans29 commented Mar 24, 2026

Uh oh!

azure-pipelines Bot commented Mar 24, 2026

Uh oh!

hariharans29 commented Mar 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

milpuz01 commented Mar 24, 2026

Uh oh!

hariharans29 commented Mar 24, 2026

Uh oh!

azure-pipelines Bot commented Mar 24, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

milpuz01 commented Mar 20, 2026 •

edited

Loading

hariharans29 commented Mar 20, 2026 •

edited

Loading