unify q persistent in register #24

carlushuang · 2023-11-01T11:28:56Z

This PR adopt the idea from #22 and applied into the unified fmha pipeline.
And use 32x32x16 mfma to do 1st/2nd gemm.

performance:
4K seqlen, 128 hdim : 131T
8K seqlen, 128 hdim : 135T

carlushuang added 2 commits November 1, 2023 07:26

unify q persistent in register

53eab18

add refactor warp_gemm dispatcher

5605132

asroy merged commit e71aa1d into main Nov 3, 2023

carlushuang deleted the q_persistent_unify branch November 3, 2023 09:11

Provide feedback