Skip to content

mlas/arm64: Add AArch64 assembly path for NCHWc float kernel and wire into build#27788

Merged
hariharans29 merged 6 commits intomicrosoft:mainfrom
milpuz01:mlas_nchwc_conv
Mar 25, 2026
Merged

mlas/arm64: Add AArch64 assembly path for NCHWc float kernel and wire into build#27788
hariharans29 merged 6 commits intomicrosoft:mainfrom
milpuz01:mlas_nchwc_conv

Conversation

@milpuz01
Copy link
Copy Markdown
Contributor

@milpuz01 milpuz01 commented Mar 20, 2026

Description

This change introduces a new AArch64 assembly owner for the NCHWc float convolution path and keeps the existing C++ entrypoint stable, with build wiring so the same kernel entry is used transparently by existing call sites.

The optimization strategy is centered on three ideas:

  1. Route the stable NCHWc C++ entrypoint to an AArch64 assembly implementation.
  2. Split execution into left/middle/right regions, with a 4-output center tile and 3/2/1 remainder kernels in the middle region.
  3. Duplicate hot loops for flags == 0 so the no-post-op path avoids repeated accumulate/bias/ReLU flag checks inside the hottest loops.

Motivation and Context

This improves throughput while preserving existing behavior and integration points and address comment #27099 (comment) from that PR (cc: @Rohanjames1997 and @hariharans29).

Per-convolution improvements from convolutions in the model from: #25580 (comment) per core count when running on AWS Graviton 4 (all of these convolution have common N = 1, KH = 3, KW = 3, DH = 1, DW = 1 and G = 1)

Cores ICOCIHIWOHOW SHSWPTPLPBPR P90 ImprovementP99 Improvement
13232192192192192111111+47.57%+43.64%
32961921929696220011+47.56%+44.69%
4819296964848220011+47.25%+45.12%
4819296969696111111+48.73%+47.07%
6425648484848111111+47.73%+47.76%
23232192192192192111111+47.72%+42.78%
32961921929696220011+48.41%+47.76%
4819296964848220011+47.13%+44.62%
4819296969696111111+48.58%+46.65%
6425648484848111111+47.59%+47.32%
43232192192192192111111+45.18%+38.51%
32961921929696220011+46.34%+45.25%
4819296964848220011+46.17%+42.45%
4819296969696111111+48.20%+45.25%
6425648484848111111+47.32%+46.67%
83232192192192192111111+40.07%+25.57%
32961921929696220011+42.93%+45.06%
4819296964848220011+44.50%+38.86%
4819296969696111111+47.31%+45.22%
6425648484848111111+47.02%+46.42%
163232192192192192111111+34.06%+28.48%
32961921929696220011+38.12%+39.83%
4819296964848220011+42.17%+35.48%
4819296969696111111+45.25%+42.73%
6425648484848111111+46.44%+45.84%

And end-to-end performance improvement when running the model on different core count:

Model Cores P90 Improvement P99 Improvement
shareable_model.onnx 1 +24.35% +24.17%
shareable_model.onnx 2 +24.14% +23.82%
shareable_model.onnx 4 +23.40% +23.22%
shareable_model.onnx 8 +22.72% +22.55%
shareable_model.onnx 16 +20.19% +19.89%

Running bench_sconv_nchwc.cpp benchmark on AWS Graviton 4 c8g.16xlarge the following is obtained:

$ ./build/Linux/Release/onnxruntime_mlas_benchmark --benchmark_filter='SCONV_NCHWC_DIRECT/DirectNchwcCases'
The number of inputs is very large. QNBITGEMM<float, 4> will be repeated at least 128 times.
The number of inputs is very large. QNBITGEMM<float, 8> will be repeated at least 128 times.
2026-03-23T17:35:58+00:00
Running ./build/Linux/Release/onnxruntime_mlas_benchmark
Run on (64 X 2000 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x64)
  L1 Instruction 64 KiB (x64)
  L2 Unified 2048 KiB (x64)
  L3 Unified 36864 KiB (x1)
Load Average: 0.01, 0.13, 0.36
--------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                              Time             CPU   Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------
SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:32/OC:32/IH:192/IW:192/KH:3/KW:3/PT:1/PL:1/PB:1/PR:1/S:1/D:1/real_time    9169970 ns      9170100 ns           76
SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:32/OC:96/IH:192/IW:192/KH:3/KW:3/PT:0/PL:0/PB:1/PR:1/S:2/D:1/real_time    7048829 ns      7047903 ns           99
SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:48/OC:192/IH:96/IW:96/KH:3/KW:3/PT:0/PL:0/PB:1/PR:1/S:2/D:1/real_time     5422228 ns      5421912 ns          129
SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:48/OC:192/IH:96/IW:96/KH:3/KW:3/PT:1/PL:1/PB:1/PR:1/S:1/D:1/real_time    22050813 ns     22047140 ns           31
SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:64/OC:256/IH:48/IW:48/KH:3/KW:3/PT:1/PL:1/PB:1/PR:1/S:1/D:1/real_time     9856375 ns      9855928 ns           71

Running the same benchmark on the same system without this PR we get:

Running ./build/Linux/Release/onnxruntime_mlas_benchmark
Run on (64 X 2000 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x64)
  L1 Instruction 64 KiB (x64)
  L2 Unified 2048 KiB (x64)
  L3 Unified 36864 KiB (x1)
Load Average: 24.34, 13.45, 5.35
--------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                              Time             CPU   Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------
SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:32/OC:32/IH:192/IW:192/KH:3/KW:3/PT:1/PL:1/PB:1/PR:1/S:1/D:1/real_time   17576371 ns     17576571 ns           40
SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:32/OC:96/IH:192/IW:192/KH:3/KW:3/PT:0/PL:0/PB:1/PR:1/S:2/D:1/real_time   14319665 ns     14316035 ns           49
SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:48/OC:192/IH:96/IW:96/KH:3/KW:3/PT:0/PL:0/PB:1/PR:1/S:2/D:1/real_time    11710396 ns     11707113 ns           60
SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:48/OC:192/IH:96/IW:96/KH:3/KW:3/PT:1/PL:1/PB:1/PR:1/S:1/D:1/real_time    47237327 ns     47217289 ns           15
SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:64/OC:256/IH:48/IW:48/KH:3/KW:3/PT:1/PL:1/PB:1/PR:1/S:1/D:1/real_time    21110794 ns     21103617 ns           33

which shows the following speed up:

Benchmark row First run (ns) Second run (ns) Speedup (first vs second)
IC32 OC32 IH192 IW192 KH3 KW3 PT1 PL1 PB1 PR1 S1 D1 9,169,970 17,576,371 1.92x
IC32 OC96 IH192 IW192 KH3 KW3 PT0 PL0 PB1 PR1 S2 D1 7,048,829 14,319,665 2.03x
IC48 OC192 IH96 IW96 KH3 KW3 PT0 PL0 PB1 PR1 S2 D1 5,422,228 11,710,396 2.16x
IC48 OC192 IH96 IW96 KH3 KW3 PT1 PL1 PB1 PR1 S1 D1 22,050,813 47,237,327 2.14x
IC64 OC256 IH48 IW48 KH3 KW3 PT1 PL1 PB1 PR1 S1 D1 9,856,375 21,110,794 2.14x

… into build

Signed-off-by: Milos Puzovic <milos.puzovic@arm.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an AArch64 NEON assembly implementation for the MLAS direct NCHWc float convolution kernel and wires it into the existing C++ entrypoint/build so the optimized path is selected transparently on supported ARM64 non-Windows builds.

Changes:

  • Route MlasConvNchwcFloatKernelNeon to a new MlasConvNchwcFloatKernelNeonAsm implementation on ARM64 + NCHWc + non-Windows.
  • Introduce a new assembly kernel (SconvNchwcKernelNeon.S) implementing left/middle/right regions and multi-output tiling (4/3/2/1).
  • Wire the new assembly source into setup_arm_neon_nchwc() and expose the new symbol via mlasi.h.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
onnxruntime/core/mlas/lib/sconv_nchwc_kernel_neon.cpp Dispatches the stable NCHWc C++ entrypoint to the new asm kernel on supported builds.
onnxruntime/core/mlas/lib/mlasi.h Declares the new asm kernel symbol for non-Windows builds.
onnxruntime/core/mlas/lib/aarch64/SconvNchwcKernelNeon.S Adds the AArch64 NEON assembly micro-kernel implementation for NCHWc float conv.
cmake/onnxruntime_mlas.cmake Adds the new assembly file to the ARM NEON NCHWc build sources (non-Windows).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/mlas/lib/aarch64/SconvNchwcKernelNeon.S Outdated
Comment thread onnxruntime/core/mlas/lib/aarch64/SconvNchwcKernelNeon.S Outdated
@hariharans29
Copy link
Copy Markdown
Member

Thanks - can you please resolve Copilot comments ?

@hariharans29
Copy link
Copy Markdown
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 4 pipeline(s).

@hariharans29
Copy link
Copy Markdown
Member

hariharans29 commented Mar 20, 2026

a 4-output center tile - Does this mean 4 output pixels at a time in the interior path ? How many filter counts are processed in that pass ? Is it 2 like the NCHW kernel (or 1 filter only due to register budget constraints) ?

Signed-off-by: Milos Puzovic <milos.puzovic@arm.com>
Comment thread onnxruntime/core/mlas/lib/sconv_nchwc_kernel_neon.cpp
Signed-off-by: Milos Puzovic <milos.puzovic@arm.com>
@milpuz01
Copy link
Copy Markdown
Contributor Author

Thanks - can you please resolve Copilot comments ?

Done. The fixes are in: 5e0dc2d

@milpuz01
Copy link
Copy Markdown
Contributor Author

a 4-output center tile - Does this mean 4 output pixels at a time in the interior path ? How many filter counts are processed in that pass ? Is it 2 like the NCHW kernel (or 1 filter only due to register budget constraints) ?

Yes, “4-output center tile” here means 4 spatial output positions (ow..ow+3) computed together in the interior (middle) path where bounds checks are not needed.

For filter count, this kernel still iterates filter blocks in the outer loop (x3 = FilterCount). Each iteration handles one NCHWc filter block (16 output channels), so the 4-output tiling is spatial only.

So unlike the NCHW kernel’s 2-output/filter-block pairing, this path is not “2 filter counts at once”; it is 4 spatial outputs per step, with filter blocks processed one-by-one in the existing loop.

@milpuz01 milpuz01 requested a review from hariharans29 March 23, 2026 12:46
@hariharans29
Copy link
Copy Markdown
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 4 pipeline(s).

  - Fix SconvNchwcKernelNeon.S KernelFlags==0 remainder path temp spill slot (sp+microsoft#120 -> sp+microsoft#248) to avoid clobbering callee-saved SIMD spills (q8-q15)
  - Replace \@-based local labels with numeric local labels (90f/91b style) for portable assembly parsing (including macOS toolchains)

Signed-off-by: Milos Puzovic <milos.puzovic@arm.com>
@hariharans29
Copy link
Copy Markdown
Member

image

it seems like there are some test failures with this change

@hariharans29
Copy link
Copy Markdown
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 4 pipeline(s).

@milpuz01
Copy link
Copy Markdown
Contributor Author

it seems like there are some test failures with this change

Yes, I was able to reproduce them. The fix is in bd07f63 and also it fixes failures with labels that were used that couldn't compile on MacOS. Running the tests now they pass:

$ ./build/Linux/Release/onnxruntime_test_all --gtest_filter='NchwcOptimizerTests.*'
Note: Google Test filter = NchwcOptimizerTests.*
[==========] Running 31 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 31 tests from NchwcOptimizerTests
[ RUN      ] NchwcOptimizerTests.ConvNchw
[       OK ] NchwcOptimizerTests.ConvNchw (138 ms)
[ RUN      ] NchwcOptimizerTests.ConvNchwc
[       OK ] NchwcOptimizerTests.ConvNchwc (59 ms)
[ RUN      ] NchwcOptimizerTests.ConvNchwcGrouped
[       OK ] NchwcOptimizerTests.ConvNchwcGrouped (39 ms)
[ RUN      ] NchwcOptimizerTests.ConvDepthwise
[       OK ] NchwcOptimizerTests.ConvDepthwise (48 ms)
[ RUN      ] NchwcOptimizerTests.ConvPointwise
[       OK ] NchwcOptimizerTests.ConvPointwise (39 ms)
[ RUN      ] NchwcOptimizerTests.ConvMaxPool
[       OK ] NchwcOptimizerTests.ConvMaxPool (11 ms)
[ RUN      ] NchwcOptimizerTests.ConvMaxPoolDilations
[       OK ] NchwcOptimizerTests.ConvMaxPoolDilations (10 ms)
[ RUN      ] NchwcOptimizerTests.ConvAveragePool
[       OK ] NchwcOptimizerTests.ConvAveragePool (12 ms)
[ RUN      ] NchwcOptimizerTests.ConvGlobalPool
[       OK ] NchwcOptimizerTests.ConvGlobalPool (13 ms)
[ RUN      ] NchwcOptimizerTests.ConvAddFusion
[       OK ] NchwcOptimizerTests.ConvAddFusion (120 ms)
[ RUN      ] NchwcOptimizerTests.ConvNoBiasAddFusion
[       OK ] NchwcOptimizerTests.ConvNoBiasAddFusion (4 ms)
[ RUN      ] NchwcOptimizerTests.FusedConvAddFusion
[       OK ] NchwcOptimizerTests.FusedConvAddFusion (20 ms)
[ RUN      ] NchwcOptimizerTests.ConvBinary
[       OK ] NchwcOptimizerTests.ConvBinary (15 ms)
[ RUN      ] NchwcOptimizerTests.ConvBinaryBroadcast
[       OK ] NchwcOptimizerTests.ConvBinaryBroadcast (9 ms)
[ RUN      ] NchwcOptimizerTests.ConvConcat
[       OK ] NchwcOptimizerTests.ConvConcat (20 ms)
[ RUN      ] NchwcOptimizerTests.ConvReuseWeightsOIHWBiBo
[       OK ] NchwcOptimizerTests.ConvReuseWeightsOIHWBiBo (5 ms)
[ RUN      ] NchwcOptimizerTests.ConvReuseWeightsOIHWBo
[       OK ] NchwcOptimizerTests.ConvReuseWeightsOIHWBo (4 ms)
[ RUN      ] NchwcOptimizerTests.ShapeInferencing
[       OK ] NchwcOptimizerTests.ShapeInferencing (7 ms)
[ RUN      ] NchwcOptimizerTests.ShapeInferencing2
[       OK ] NchwcOptimizerTests.ShapeInferencing2 (5 ms)
[ RUN      ] NchwcOptimizerTests.MixedOutputUsage
[       OK ] NchwcOptimizerTests.MixedOutputUsage (5 ms)
[ RUN      ] NchwcOptimizerTests.TensorAlignment
[       OK ] NchwcOptimizerTests.TensorAlignment (6 ms)
[ RUN      ] NchwcOptimizerTests.IntermediatesAsGraphOutputs
[       OK ] NchwcOptimizerTests.IntermediatesAsGraphOutputs (5 ms)
[ RUN      ] NchwcOptimizerTests.BatchNormalization
[       OK ] NchwcOptimizerTests.BatchNormalization (5 ms)
[ RUN      ] NchwcOptimizerTests.ConvReorderInputNhwc
[       OK ] NchwcOptimizerTests.ConvReorderInputNhwc (25 ms)
[ RUN      ] NchwcOptimizerTests.ConvReorderOutputNhwc
[       OK ] NchwcOptimizerTests.ConvReorderOutputNhwc (5 ms)
[ RUN      ] NchwcOptimizerTests.ConvReorderOutputBoth
[       OK ] NchwcOptimizerTests.ConvReorderOutputBoth (5 ms)
[ RUN      ] NchwcOptimizerTests.ConvReorderOutputCnhw
[       OK ] NchwcOptimizerTests.ConvReorderOutputCnhw (5 ms)
[ RUN      ] NchwcOptimizerTests.UpsampleNearest
[       OK ] NchwcOptimizerTests.UpsampleNearest (162 ms)
[ RUN      ] NchwcOptimizerTests.UpsampleLinear
[       OK ] NchwcOptimizerTests.UpsampleLinear (499 ms)
[ RUN      ] NchwcOptimizerTests.Activation
[       OK ] NchwcOptimizerTests.Activation (16 ms)
[ RUN      ] NchwcOptimizerTests.MaxPoolTypeCheck
[       OK ] NchwcOptimizerTests.MaxPoolTypeCheck (4 ms)
[----------] 31 tests from NchwcOptimizerTests (1334 ms total)

[----------] Global test environment tear-down
[==========] 31 tests from 1 test suite ran. (1334 ms total)
[  PASSED  ] 31 tests.

@hariharans29
Copy link
Copy Markdown
Member

it seems like there are some test failures with this change

Yes, I was able to reproduce them. The fix is in bd07f63 and also it fixes failures with labels that were used that couldn't compile on MacOS. Running the tests now they pass:

$ ./build/Linux/Release/onnxruntime_test_all --gtest_filter='NchwcOptimizerTests.*'
Note: Google Test filter = NchwcOptimizerTests.*
[==========] Running 31 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 31 tests from NchwcOptimizerTests
[ RUN      ] NchwcOptimizerTests.ConvNchw
[       OK ] NchwcOptimizerTests.ConvNchw (138 ms)
[ RUN      ] NchwcOptimizerTests.ConvNchwc
[       OK ] NchwcOptimizerTests.ConvNchwc (59 ms)
[ RUN      ] NchwcOptimizerTests.ConvNchwcGrouped
[       OK ] NchwcOptimizerTests.ConvNchwcGrouped (39 ms)
[ RUN      ] NchwcOptimizerTests.ConvDepthwise
[       OK ] NchwcOptimizerTests.ConvDepthwise (48 ms)
[ RUN      ] NchwcOptimizerTests.ConvPointwise
[       OK ] NchwcOptimizerTests.ConvPointwise (39 ms)
[ RUN      ] NchwcOptimizerTests.ConvMaxPool
[       OK ] NchwcOptimizerTests.ConvMaxPool (11 ms)
[ RUN      ] NchwcOptimizerTests.ConvMaxPoolDilations
[       OK ] NchwcOptimizerTests.ConvMaxPoolDilations (10 ms)
[ RUN      ] NchwcOptimizerTests.ConvAveragePool
[       OK ] NchwcOptimizerTests.ConvAveragePool (12 ms)
[ RUN      ] NchwcOptimizerTests.ConvGlobalPool
[       OK ] NchwcOptimizerTests.ConvGlobalPool (13 ms)
[ RUN      ] NchwcOptimizerTests.ConvAddFusion
[       OK ] NchwcOptimizerTests.ConvAddFusion (120 ms)
[ RUN      ] NchwcOptimizerTests.ConvNoBiasAddFusion
[       OK ] NchwcOptimizerTests.ConvNoBiasAddFusion (4 ms)
[ RUN      ] NchwcOptimizerTests.FusedConvAddFusion
[       OK ] NchwcOptimizerTests.FusedConvAddFusion (20 ms)
[ RUN      ] NchwcOptimizerTests.ConvBinary
[       OK ] NchwcOptimizerTests.ConvBinary (15 ms)
[ RUN      ] NchwcOptimizerTests.ConvBinaryBroadcast
[       OK ] NchwcOptimizerTests.ConvBinaryBroadcast (9 ms)
[ RUN      ] NchwcOptimizerTests.ConvConcat
[       OK ] NchwcOptimizerTests.ConvConcat (20 ms)
[ RUN      ] NchwcOptimizerTests.ConvReuseWeightsOIHWBiBo
[       OK ] NchwcOptimizerTests.ConvReuseWeightsOIHWBiBo (5 ms)
[ RUN      ] NchwcOptimizerTests.ConvReuseWeightsOIHWBo
[       OK ] NchwcOptimizerTests.ConvReuseWeightsOIHWBo (4 ms)
[ RUN      ] NchwcOptimizerTests.ShapeInferencing
[       OK ] NchwcOptimizerTests.ShapeInferencing (7 ms)
[ RUN      ] NchwcOptimizerTests.ShapeInferencing2
[       OK ] NchwcOptimizerTests.ShapeInferencing2 (5 ms)
[ RUN      ] NchwcOptimizerTests.MixedOutputUsage
[       OK ] NchwcOptimizerTests.MixedOutputUsage (5 ms)
[ RUN      ] NchwcOptimizerTests.TensorAlignment
[       OK ] NchwcOptimizerTests.TensorAlignment (6 ms)
[ RUN      ] NchwcOptimizerTests.IntermediatesAsGraphOutputs
[       OK ] NchwcOptimizerTests.IntermediatesAsGraphOutputs (5 ms)
[ RUN      ] NchwcOptimizerTests.BatchNormalization
[       OK ] NchwcOptimizerTests.BatchNormalization (5 ms)
[ RUN      ] NchwcOptimizerTests.ConvReorderInputNhwc
[       OK ] NchwcOptimizerTests.ConvReorderInputNhwc (25 ms)
[ RUN      ] NchwcOptimizerTests.ConvReorderOutputNhwc
[       OK ] NchwcOptimizerTests.ConvReorderOutputNhwc (5 ms)
[ RUN      ] NchwcOptimizerTests.ConvReorderOutputBoth
[       OK ] NchwcOptimizerTests.ConvReorderOutputBoth (5 ms)
[ RUN      ] NchwcOptimizerTests.ConvReorderOutputCnhw
[       OK ] NchwcOptimizerTests.ConvReorderOutputCnhw (5 ms)
[ RUN      ] NchwcOptimizerTests.UpsampleNearest
[       OK ] NchwcOptimizerTests.UpsampleNearest (162 ms)
[ RUN      ] NchwcOptimizerTests.UpsampleLinear
[       OK ] NchwcOptimizerTests.UpsampleLinear (499 ms)
[ RUN      ] NchwcOptimizerTests.Activation
[       OK ] NchwcOptimizerTests.Activation (16 ms)
[ RUN      ] NchwcOptimizerTests.MaxPoolTypeCheck
[       OK ] NchwcOptimizerTests.MaxPoolTypeCheck (4 ms)
[----------] 31 tests from NchwcOptimizerTests (1334 ms total)

[----------] Global test environment tear-down
[==========] 31 tests from 1 test suite ran. (1334 ms total)
[  PASSED  ] 31 tests.

Thanks. I take it that the clobbering fix didn't have any tangible perf loss (i.e.) are the numbers pasted in the PR still true ?

@hariharans29
Copy link
Copy Markdown
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 4 pipeline(s).

@milpuz01
Copy link
Copy Markdown
Contributor Author

Thanks. I take it that the clobbering fix didn't have any tangible perf loss (i.e.) are the numbers pasted in the PR still true ?

Yes, no performance loss. The numbers are the same as pasted in the PR.

@hariharans29
Copy link
Copy Markdown
Member

Still has some failures - it seems like ?

…kage

Signed-off-by: Milos Puzovic <milos.puzovic@arm.com>
@milpuz01
Copy link
Copy Markdown
Contributor Author

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link
Copy Markdown

Commenter does not have sufficient privileges for PR 27788 in repo microsoft/onnxruntime

@milpuz01
Copy link
Copy Markdown
Contributor Author

Still has some failures - it seems like ?

Yes, it is MacOS failure. In the code we used x18 as a general-purpose register, but that register is reserved by the platform ABI. We removed in b928c7d the usage and are keeping the variable that was stored that before on a stack. Rerun it on both MacOS and Linux and increase in extra stack pill doesn't have any impact on reported benchmark results.

I have tried to rerun the pipelines but I do not have sufficient privileges to rerun.

Comment thread onnxruntime/core/mlas/lib/aarch64/SconvNchwcKernelNeon.S
@hariharans29
Copy link
Copy Markdown
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@hariharans29 hariharans29 requested a review from Copilot March 24, 2026 17:25
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 4 pipeline(s).

@hariharans29
Copy link
Copy Markdown
Member

Still has some failures - it seems like ?

Yes, it is MacOS failure. In the code we used x18 as a general-purpose register, but that register is reserved by the platform ABI. We removed in b928c7d the usage and are keeping the variable that was stored that before on a stack. Rerun it on both MacOS and Linux and increase in extra stack pill doesn't have any impact on reported benchmark results.

I have tried to rerun the pipelines but I do not have sufficient privileges to rerun.

Thanks for the fixes. I ll re-run the pipelines. I ve also asked Copilot for one more review. Provided all is good, I will merge it, thanks

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/test/mlas/bench/bench_sconv_nchwc.cpp Outdated
Comment thread onnxruntime/test/mlas/bench/bench_sconv_nchwc.cpp Outdated
@milpuz01
Copy link
Copy Markdown
Contributor Author

I ve also asked Copilot for one more review. Provided all is good, I will merge it, thanks

Thank you! I have addressed the Copilot reviews. I reran the benchmark after the changes and this PR slightly improved per case by 0.1x, but I do not think that is the big change. Before I committed the change I believe that there were no failures in the pipeline.

@hariharans29
Copy link
Copy Markdown
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@hariharans29 hariharans29 enabled auto-merge (squash) March 24, 2026 20:58
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 4 pipeline(s).

Comment thread onnxruntime/test/mlas/bench/bench_sconv_nchwc.cpp
@hariharans29 hariharans29 disabled auto-merge March 24, 2026 21:02
@hariharans29 hariharans29 enabled auto-merge (squash) March 25, 2026 17:38
@hariharans29 hariharans29 merged commit e10ea97 into microsoft:main Mar 25, 2026
101 of 103 checks passed
@milpuz01 milpuz01 deleted the mlas_nchwc_conv branch March 26, 2026 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants