Optimize FlashAttention for M4 Max (20x speedup) by xenova · Pull Request #27780 · microsoft/onnxruntime

xenova · 2026-03-20T05:24:01Z

MultiHeadAttention
Before: 58.3s
After: 2.89
Speedup: 20x

Description

Motivation and Context

Tested with vision_encoder.onnx for https://huggingface.co/onnx-community/LightOnOCR-2-1B-ONNX

MultiHeadAttention Before: 58.3s After: 5.4s Speedup: 10.8x

xenova · 2026-03-20T06:27:04Z

@guschmue 🙏

qjia7 · 2026-03-20T08:58:31Z

Awesome! I made a similar change a few days ago to optimize Whisper locally, but your approach is more comprehensive than mine. I just tested it and observed comparable improvements to those in #27781 for Whisper. I’ll go ahead and close that one, and I’m looking forward to seeing this land soon!

qjia7 · 2026-03-20T10:30:22Z


 // Private memory per lane.
 var<private> q_tile : array<q_value_t, head_size_vec>;
+var<private> qk_scores : array<q_element_t, max_k_step>;


When max_k_step = 128 (e.g., head_size=32 with f16): this allocates 128 private registers per lane for QK scores. On some GPUs, this may cause register spilling and hurt performance. Have you tested this on less powerful devices, such as Intel Tiger Lake or Qualcomm?

Have you tested this on less powerful devices, such as Intel Tiger Lake or Qualcomm?

Unfortunately not, I've mainly just tested on my device to be honest. Do you have recommendations or a CI that can help with this?

reducing max_k_step to 64 doesn't hurt M4 Max performance

6487515

I found regressions on Qualcomm for phi4. It seems that register spilling happens (2s -> 11s for FlashAttention). Maybe we should keep the original path for qualcomm (not test Tiger Lake yet).
The baseline is as below:

With this PR:

Thanks so much for testing! I will keep the qualcomm path based on this feedback.

added back qualcomm path. @qjia7 an you do another round of testing? That change doesn't affect current performance on M4.

Run Op Name Count Total Avg Min Max % Total Provider(s)

main MultiHeadAttention 168 693.627 ms 4.129 ms 10.0 us 8.340 ms 80.26% WebGpu

this PR (w/o qualcomm path) MultiHeadAttention 168 77.239 ms 459.8 us 10.0 us 917.0 us 31.06% WebGpu

this PR (w/ qualcomm path) MultiHeadAttention 168 77.270 ms 459.9 us 10.0 us 920.0 us 31.20% WebGpu

Verified on Qualcomm. The perf is back. Thanks.

xenova · 2026-03-20T17:23:24Z

Awesome! I made a similar change a few days ago to optimize Whisper locally, but your approach is more comprehensive than mine. I just tested it and observed comparable improvements to those in #27781 for Whisper. I’ll go ahead and close that one, and I’m looking forward to seeing this land soon!

Great! 😄 I must admit that the changes I made were very optimized for my M4 Max and this specific vision encoder. But as you mentioned above, it does seem to help with Whisper too.

Also, a lot of the PR diff is removing prefer_subgroupshuffle... which may not be good across other devices. @guschmue lmk what you think!

microsoft#27780 (comment)

xenova · 2026-03-20T21:45:51Z

Ran some more benchmarks on some other models.

Model	ONNX file	Op	before	after	speedup
onnx-community/all-MiniLM-L6-v2-ONNX	model.onnx	MultiHeadAttention	4.36ms	2.02ms	2.16x
onnx-community/gemma-3-270m-it-ONNX	model.onnx	GroupQueryAttention	3.65ms	1.63ms	2.24x
onnx-community/LightOnOCR-2-1B-ONNX	vision_encoder.onnx	MultiHeadAttention	192s	9.3s	20.71x
onnx-community/LightOnOCR-2-1B-ONNX	decoder_model_merged.onnx	GroupQueryAttention	14.3ms	7.4ms	1.96x

xenova · 2026-03-21T02:47:21Z

been testing more and more... every model sees a 2-3x performance for the MHA nodes. Hoping we can get some benchmarking done on lower-end devices so we can fast-track the PR!

guschmue · 2026-03-23T16:15:57Z

extra cool!

guschmue · 2026-03-23T16:20:50Z

I can give it a run on tiger lake in the afternoon.

xenova · 2026-03-23T16:24:13Z

extra cool!
I can give it a run on tiger lake in the afternoon.

Great! Hopefully we see good performance 🤞

xenova · 2026-03-23T21:12:10Z

Based on Guenther's feedback, I updated the implementation so that we only use my optimized branch for apple hardware. Everything else falls back to original implementation. I see the performance increase on my device (m4 max), and other hardware should produce the same benchmarks as before.

@qjia7 @guschmue PTAL 🙏

qjia7 · 2026-03-24T01:47:09Z

+      if (max_k_from_shm >= 64) {
+        max_k_step_ = 64;
+      } else if (max_k_from_shm >= 32) {
+        max_k_step_ = 32;


Your current method is to use more registers to improvement the performance. Do you measure that how much perf gap if we use max_k_step_ = 32 instead of max_k_step_ = 64 for M4 Max? And how about using max_k_step_ = 32 plus subgroupShuffle compared with max_k_step_ = 64 for M4 Max? If they can get the similar performance, I prefer we use max_k_step_ = 32 for apple and nvdia, which can help reduce the register pressure (such as M1). My previous machine is NV and see very good improvement for whisper with max_k_step_ = 32.

sure I can test that.

okay, max_k_step_ = 32 has no noticeable performance difference vs. 64

max_k_step_ = 32 plus subgroupShuffle causes significant issues.

Weird that max_k_step_ = 32 plus subgroupShuffle causes significant issues. Thanks for trying. The latest change looks good to me.

yeah -- weird that it only happens for apple. maybe an upstream implementation issue in dawn?

@guschmue I think we're good to merge? 😇

guschmue · 2026-03-24T15:44:48Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-03-24T15:45:12Z

Azure Pipelines successfully started running 4 pipeline(s).

sroussey · 2026-03-25T19:10:37Z

@xenova might fix the preprocessor directive whitespace issues in case that is holding this up. definitely looking forward to this improvement!

xenova · 2026-03-26T02:22:24Z

@xenova might fix the preprocessor directive whitespace issues in case that is holding this up. definitely looking forward to this improvement!

We're noticing some regressions for older (~M1) apple hardware... so we're still trying to figure out what the optimal setup looks like.

kokroo · 2026-04-05T01:39:54Z

@xenova might fix the preprocessor directive whitespace issues in case that is holding this up. definitely looking forward to this improvement!

We're noticing some regressions for older (~M1) apple hardware... so we're still trying to figure out what the optimal setup looks like.

+1. The fixes don't really spped anything up on my M1 Pro.

xenova · 2026-04-14T10:31:30Z

@qjia7 it would be great to be able to get this working (in such a way that doesn't affect other hardware). Any ideas?

xenova · 2026-04-14T11:06:29Z

noticed again with https://huggingface.co/onnx-community/depth-anything-v2-small-ONNX where my branch is around 6x faster (460ms -> 75ms)

qjia7 · 2026-04-14T15:16:49Z

@qjia7 it would be great to be able to get this working (in such a way that doesn't affect other hardware). Any ideas?

Could you gather more details from AdapterInfo and further constrain this to M4 Max? If so, we can move forward with landing it. I can also take a deeper look into the regression causes introduced by these changes. Hopefully, we can enable this on more devices—I'm seeing significant gains on my NV device as well.

For example, I can retrieve the adapter info as shown below. I expect you should also be able to distinguish M4 Max using the adapter information.

vendor="nvidia"
architecture="lovelace"
device="NVIDIA RTX 2000 Ada Generation Laptop GPU"
backend_type=4, vendor_id=4318, device_id=10424

sroussey · 2026-04-14T16:01:12Z

I have an M2 Max I can test. What is the quickest way to do so?

xenova · 2026-04-14T16:55:53Z

@qjia7 the only useful information I can see is probably architecture : "metal-3" (vendor is apple). everything else appears blank.

qjia7 · 2026-04-15T03:29:05Z

@qjia7 the only useful information I can see is probably architecture : "metal-3" (vendor is apple). everything else appears blank.

How about providing a WebGPU EP session option, something like

ep.webgpuexecutionprovider.experimentalEnableAggressiveFlashAttention = "1"

Benefits:
Safe to land — off by default, zero regression risk on any device
Device-agnostic — anyone (Apple, NVIDIA, Intel) can opt in and test
Data-driven follow-up — once we collect enough benchmarks across devices, a future PR can auto-enable it for known-good architectures

Note: I see this as a temporary stepping stone, not a permanent solution. For follow-up work:

Root-cause the regressions on some unexpected devices — Having the opt-in flag makes it easy to A/B test on affected machines. We can do deeper analysis on why this regresses (register spilling? workgroup size mismatch? shared memory pressure?). There may also still be room to further optimize the current shader.

File a Dawn bug for richer GPU info — Currently Dawn's AdapterInfo only exposes architecture: "metal-3" for Apple, which isn’t sufficient to distinguish M4 Max from other variants. It would be helpful to ask for more detailed GPU identification. Once Dawn exposes that information, we can follow up with a PR to automatically enable the optimization on the appropriate devices and eventually deprecate the manual opt‑in.

What do you think? @xenova @guschmue

Optimize FlashAttention for M4 Max

78d8ccb

MultiHeadAttention Before: 58.3s After: 5.4s Speedup: 10.8x

xenova marked this pull request as draft March 20, 2026 05:31

fixes

1fa33c9

xenova changed the title ~~Optimize FlashAttention for M4 Max (10.8x speedup)~~ Optimize FlashAttention for M4 Max (12x speedup) Mar 20, 2026

more optimizations

864561a

xenova changed the title ~~Optimize FlashAttention for M4 Max (12x speedup)~~ Optimize FlashAttention for M4 Max (20x speedup) Mar 20, 2026

xenova marked this pull request as ready for review March 20, 2026 06:26

qjia7 mentioned this pull request Mar 20, 2026

webgpu: Increase FlashAttention max_k_step to 32 for head_size <= 64 #27781

Closed

qjia7 reviewed Mar 20, 2026

View reviewed changes

Address comment for lower-end devices

6487515

microsoft#27780 (comment)

xenova mentioned this pull request Mar 20, 2026

[Performance] Qwen3.5-4B (q4f16) is ~3x slower decode + ~20x slower TTFT vs Qwen3-4B on WebGPU (Transformers.js 4.0.0-next.7) huggingface/transformers.js#1599

Open

5 tasks

remove unused is_nvidia parameter

609a9b7

keep original implementation for qualcomm

3b3315e

guschmue added the ep:WebGPU ort-web webgpu provider label Mar 23, 2026

use original approach for everything non-apple

6b75d1f

xenova added 5 commits March 23, 2026 17:15

cleanup

eaeeea4

keep diff small

8a99f0f

manually update indentation

2f2ff1a

add back comment

a3cb6b7

cleanup

90d2f41

qjia7 reviewed Mar 24, 2026

View reviewed changes

cap max_k_step_ to 32 on apple hardware

4334a23

Run	Op Name	Count	Total	Avg	Min	Max	% Total	Provider(s)
main	MultiHeadAttention	168	693.627 ms	4.129 ms	10.0 us	8.340 ms	80.26%	WebGpu
this PR (w/o qualcomm path)	MultiHeadAttention	168	77.239 ms	459.8 us	10.0 us	917.0 us	31.06%	WebGpu
this PR (w/ qualcomm path)	MultiHeadAttention	168	77.270 ms	459.9 us	10.0 us	920.0 us	31.20%	WebGpu

Conversation

xenova commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

xenova commented Mar 20, 2026

Uh oh!

qjia7 commented Mar 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xenova commented Mar 20, 2026

Uh oh!

xenova commented Mar 20, 2026

Uh oh!

xenova commented Mar 21, 2026

Uh oh!

guschmue commented Mar 23, 2026

Uh oh!

guschmue commented Mar 23, 2026

Uh oh!

xenova commented Mar 23, 2026

Uh oh!

xenova commented Mar 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xenova Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guschmue commented Mar 24, 2026

Uh oh!

azure-pipelines Bot commented Mar 24, 2026

Uh oh!

sroussey commented Mar 25, 2026

Uh oh!

xenova commented Mar 26, 2026

Uh oh!

kokroo commented Apr 5, 2026

Uh oh!

xenova commented Apr 14, 2026

Uh oh!

xenova commented Apr 14, 2026

Uh oh!

qjia7 commented Apr 14, 2026

Uh oh!

sroussey commented Apr 14, 2026

Uh oh!

xenova commented Apr 14, 2026

Uh oh!

qjia7 commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

xenova commented Mar 20, 2026 •

edited

Loading

xenova Mar 24, 2026 •

edited

Loading