Skip to content

Add FLASH_ATTN_HDIMS option to limit kernel compilation#2029

Open
Caellian wants to merge 1 commit intoOpenNMT:masterfrom
Caellian:master
Open

Add FLASH_ATTN_HDIMS option to limit kernel compilation#2029
Caellian wants to merge 1 commit intoOpenNMT:masterfrom
Caellian:master

Conversation

@Caellian
Copy link
Copy Markdown

@Caellian Caellian commented Apr 3, 2026

In many applications model head dimensions are known in advance (even regardless of model choice) and it's possible to opt-out of compiling ones that will never be used.

In my case, I need CTranslate2 only for whisper models which means I can cut down compile times a lot by setting the FLASH_ATTN_HDIMS="64" option. Newer LLMs also almost always use 128.

Default it backwards compatible, can be explicitly set to speed up builds.

In many applications model head dimensions are known in advance and it's
possible to opt-out of compiling ones that will never be used, even
regardless of model choice.

Signed-off-by: Tin Švagelj <tin.svagelj@live.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant