Skip to content

Add splitk hgemm kernel#274

Draft
xytpai wants to merge 3 commits intomainfrom
xyt/hgemm_spk
Draft

Add splitk hgemm kernel#274
xytpai wants to merge 3 commits intomainfrom
xyt/hgemm_spk

Conversation

@xytpai
Copy link

@xytpai xytpai commented Mar 23, 2026

Motivation

For improving latency bound hgemm with slim shapes

Technical Details

Use split-k + clean policy

Test Result

Platform: MI308

m n k tile-config dtype Torch-hipblas (us) FlyDSL (us) Speedup
32 8192 2048 32x256x128-spk4 bf16 32.7 20.2 1.6
32 384 7168 16x128x128-spk8 bf16 12.9 8.4 1.5
32 1024 7168 16x128x128-spk8 bf16 19.1 12.4 1.5
64 1024 7168 32x128x128-spk8 bf16 23.5 15.9 1.5
128 1024 7168 64x128x128-spk8 bf16 35.9 23.4 1.5
256 1024 7168 64x256x64-spk4 bf16 47.5 41.6 1.1

Platform: MI355

m n k tile-config dtype Torch-hipblas (us) FlyDSL (us) Speedup
32 8192 2048 32x256x128-spk4 bf16 13.2 12.5 1.1
32 384 7168 16x128x128-spk8 bf16 10.9 7.0 1.5
32 1024 7168 16x128x128-spk8 bf16 12.8 10.1 1.3
64 1024 7168 32x128x128-spk8 bf16 13.3 10.7 1.2
128 1024 7168 64x128x128-spk8 bf16 18.1 14.0 1.3
256 1024 7168 64x128x128-spk8 bf16 17.8 16.8 1.1

@coderfeli
Copy link
Collaborator

bench mark and test results? also compare large shapes like 8192^3 ?

@xytpai
Copy link
Author

xytpai commented Mar 24, 2026

bench mark and test results? also compare large shapes like 8192^3 ?

For now, the priority is to cover the kimi cases and adding communication-compute fusion. We can iterate on optimizations in the next step.

@xytpai xytpai marked this pull request as draft March 25, 2026 06:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants