Update/add to qr_ks_vs_whole_k_prefetch pipeline#3485
Update/add to qr_ks_vs_whole_k_prefetch pipeline#3485
Conversation
…oping Gemm0 along n0 dimension
…e_k_prefetch pipeline
…n whole_k_prefetch path)
…n whole_k_prefetch path in trload pipeline)
… next iteration in the non-whole-k-perfetch path
There was a problem hiding this comment.
Pull request overview
This PR updates and enhances the qr_ks_vs_whole_k_prefetch pipeline to improve performance on MI350 GPUs through better MFMA instruction usage, transposed V-loading support, and N0-loop implementation. The pipeline targets scenarios where work-group counts are low, enabling better CU occupancy by using smaller MTile sizes (kM0=64 vs 128) while prefetching entire K tiles.
Changes:
- Adds transposed V-loading support (
qr_ks_vs_whole_k_prefetch_trload) to reduce shuffle instructions on MI350 - Implements N0-loop based Gemm0 to reduce tile window movement overhead and eliminate clear_tile calls
- Adds full support for hdim96/hdim160 without padding requirements
- Updates MFMA instruction selection to ensure optimal choices for MI350
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| block_gemm_areg_bsmem_trload_creg_v2_prefetch_n.hpp | New GEMM block implementation supporting transposed V-loading with N-dimension prefetching |
| block_gemm_areg_bsmem_creg_v2_prefetch_n.hpp | N-dimension prefetching GEMM implementation for standard (non-transposed) loading |
| block_gemm_areg_bsmem_creg_v2_prefetch_k.hpp | K-dimension prefetching GEMM implementation |
| tile_fmha_shape.hpp | Adds kN0Sub field and relaxes static assertion for N0-loop support |
| block_fmha_pipeline_qr_ks_vs_whole_k_prefetch_trload.hpp | New pipeline variant with transposed V-loading |
| block_fmha_pipeline_qr_ks_vs_whole_k_prefetch_default_policy.hpp | Comprehensive policy updates for LDS management, alignment, and MFMA selection |
| block_fmha_pipeline_qr_ks_vs_whole_k_prefetch.hpp | Core pipeline updated with N0-loop implementation and simplified memory management |
| block_fmha_pipeline_problem.hpp | Adds utility functions for calculating optimal vector sizes |
| fmha_fwd_kernel.hpp | Kernel updates to support N0-loop pipelines and naive hdim loading |
| fmha.hpp | Includes new trload pipeline header |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
include/ck_tile/ops/gemm/block/block_gemm_areg_bsmem_creg_v2_prefetch_n.hpp
Outdated
Show resolved
Hide resolved
include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs_whole_k_prefetch.hpp
Outdated
Show resolved
Hide resolved
include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs_whole_k_prefetch.hpp
Outdated
Show resolved
Hide resolved
include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_problem.hpp
Outdated
Show resolved
Hide resolved
|
we found async can beat wholek with a new config, will discuss with qianfeng |
|
Error importing due to merge conflicts – please reopen the PR on ROCm/rocm-libraries |
About qr_ks_vs_whole_k_prefetch pipeline
This PR updates and enhances the qr_ks_vs_whole_k_prefetch pipeline to improve performance on MI350 GPUs through better MFMA instruction usage, transposed V-loading support, and N0-loop implementation. The pipeline targets scenarios where work-group counts are low, enabling better CU occupancy by using smaller MTile sizes (kM0=64 vs 128) while prefetching entire K tiles.
Changes:
Adds transposed V-loading support (qr_ks_vs_whole_k_prefetch_trload) to reduce shuffle instructions on MI350
Implements N0-loop based Gemm0 to reduce tile window movement overhead and eliminate clear_tile calls
Adds full support for hdim96/hdim160 without padding requirements
Updates MFMA instruction selection to ensure optimal choices for MI350
Performance results
qr_ks_vs_async_whole_k_prefetch_trloadshows much better performance thanqr_ks_vs_async_trloadon the same case (execution time41.02msby whole_k_prefetch_trload &58.50msby async_load), andqr_ks_vs_async_whole_k_prefetch_trloadalso shows obviously better performance the recently tunedqr_ks_vs_asyncon the same case (execution time41.02msby whole_k_prefetch_trload 742.96msby qr_ks_vs_async)qr_ks_vs_async_whole_k_prefetchshows much better performance theqr_ks_vs_async(which is supposed to be very high-efficient) on the same case (execution time78.02msby whole_k_prefetch &100.20msby qr_ks_vs_async)qr_ks_vs_async_whole_k_prefetch_trloadshow a little bit better performance thanqr_ks_vs_asyncon mi350 (execution time104.50msby whole_k_prefetch_trload &106.50msby qr_ks_vs_async). And they shows completely on-par performance on MI300Test/Verify
test_whole_k_prefetch_n0loopto test/verify qr_ks_vs_whole_k_prefetch pipeline since this pipeline can not be used by ck_tile fmha example so farDiscussion