mlas/arm64: Add AArch64 assembly path for NCHWc float kernel and wire into build#27788
mlas/arm64: Add AArch64 assembly path for NCHWc float kernel and wire into build#27788hariharans29 merged 6 commits intomicrosoft:mainfrom
Conversation
… into build Signed-off-by: Milos Puzovic <milos.puzovic@arm.com>
There was a problem hiding this comment.
Pull request overview
This PR adds an AArch64 NEON assembly implementation for the MLAS direct NCHWc float convolution kernel and wires it into the existing C++ entrypoint/build so the optimized path is selected transparently on supported ARM64 non-Windows builds.
Changes:
- Route
MlasConvNchwcFloatKernelNeonto a newMlasConvNchwcFloatKernelNeonAsmimplementation on ARM64 + NCHWc + non-Windows. - Introduce a new assembly kernel (
SconvNchwcKernelNeon.S) implementing left/middle/right regions and multi-output tiling (4/3/2/1). - Wire the new assembly source into
setup_arm_neon_nchwc()and expose the new symbol viamlasi.h.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| onnxruntime/core/mlas/lib/sconv_nchwc_kernel_neon.cpp | Dispatches the stable NCHWc C++ entrypoint to the new asm kernel on supported builds. |
| onnxruntime/core/mlas/lib/mlasi.h | Declares the new asm kernel symbol for non-Windows builds. |
| onnxruntime/core/mlas/lib/aarch64/SconvNchwcKernelNeon.S | Adds the AArch64 NEON assembly micro-kernel implementation for NCHWc float conv. |
| cmake/onnxruntime_mlas.cmake | Adds the new assembly file to the ARM NEON NCHWc build sources (non-Windows). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Thanks - can you please resolve Copilot comments ? |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
|
|
Signed-off-by: Milos Puzovic <milos.puzovic@arm.com>
Signed-off-by: Milos Puzovic <milos.puzovic@arm.com>
Done. The fixes are in: 5e0dc2d |
Yes, “4-output center tile” here means 4 spatial output positions ( For filter count, this kernel still iterates filter blocks in the outer loop ( So unlike the NCHW kernel’s 2-output/filter-block pairing, this path is not “2 filter counts at once”; it is 4 spatial outputs per step, with filter blocks processed one-by-one in the existing loop. |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
- Fix SconvNchwcKernelNeon.S KernelFlags==0 remainder path temp spill slot (sp+microsoft#120 -> sp+microsoft#248) to avoid clobbering callee-saved SIMD spills (q8-q15) - Replace \@-based local labels with numeric local labels (90f/91b style) for portable assembly parsing (including macOS toolchains) Signed-off-by: Milos Puzovic <milos.puzovic@arm.com>
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
Yes, I was able to reproduce them. The fix is in bd07f63 and also it fixes failures with labels that were used that couldn't compile on MacOS. Running the tests now they pass: |
Thanks. I take it that the clobbering fix didn't have any tangible perf loss (i.e.) are the numbers pasted in the PR still true ? |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
Yes, no performance loss. The numbers are the same as pasted in the PR. |
|
Still has some failures - it seems like ? |
…kage Signed-off-by: Milos Puzovic <milos.puzovic@arm.com>
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Commenter does not have sufficient privileges for PR 27788 in repo microsoft/onnxruntime |
Yes, it is MacOS failure. In the code we used I have tried to rerun the pipelines but I do not have sufficient privileges to rerun. |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
Thanks for the fixes. I ll re-run the pipelines. I ve also asked Copilot for one more review. Provided all is good, I will merge it, thanks |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Thank you! I have addressed the Copilot reviews. I reran the benchmark after the changes and this PR slightly improved per case by 0.1x, but I do not think that is the big change. Before I committed the change I believe that there were no failures in the pipeline. |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |

Description
This change introduces a new AArch64 assembly owner for the NCHWc float convolution path and keeps the existing C++ entrypoint stable, with build wiring so the same kernel entry is used transparently by existing call sites.
The optimization strategy is centered on three ideas:
Motivation and Context
This improves throughput while preserving existing behavior and integration points and address comment #27099 (comment) from that PR (cc: @Rohanjames1997 and @hariharans29).
Per-convolution improvements from convolutions in the model from: #25580 (comment) per core count when running on AWS Graviton 4 (all of these convolution have common N = 1, KH = 3, KW = 3, DH = 1, DW = 1 and G = 1)
And end-to-end performance improvement when running the model on different core count:
Running
bench_sconv_nchwc.cppbenchmark on AWS Graviton 4c8g.16xlargethe following is obtained:Running the same benchmark on the same system without this PR we get:
which shows the following speed up: