Add FLASH_ATTN_HDIMS option to limit kernel compilation by Caellian · Pull Request #2029 · OpenNMT/CTranslate2

Caellian · 2026-04-03T23:10:52Z

In many applications model head dimensions are known in advance (even regardless of model choice) and it's possible to opt-out of compiling ones that will never be used.

In my case, I need CTranslate2 only for whisper models which means I can cut down compile times a lot by setting the FLASH_ATTN_HDIMS="64" option. Newer LLMs also almost always use 128.

Default it backwards compatible, can be explicitly set to speed up builds.

In many applications model head dimensions are known in advance and it's possible to opt-out of compiling ones that will never be used, even regardless of model choice. Signed-off-by: Tin Švagelj <tin.svagelj@live.com>

Add FLASH_ATTN_HDIMS option to limit kernel compilation

5866e41

In many applications model head dimensions are known in advance and it's possible to opt-out of compiling ones that will never be used, even regardless of model choice. Signed-off-by: Tin Švagelj <tin.svagelj@live.com>

Caellian force-pushed the master branch from 4703728 to 5866e41 Compare April 4, 2026 05:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FLASH_ATTN_HDIMS option to limit kernel compilation#2029

Add FLASH_ATTN_HDIMS option to limit kernel compilation#2029
Caellian wants to merge 1 commit intoOpenNMT:masterfrom
Caellian:master

Caellian commented Apr 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Caellian commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Caellian commented Apr 3, 2026 •

edited

Loading