Update/add to qr_ks_vs_whole_k_prefetch pipeline by qianfengz · Pull Request #3485 · ROCm/composable_kernel

qianfengz · 2025-12-24T10:00:35Z

About qr_ks_vs_whole_k_prefetch pipeline

This PR updates and enhances the qr_ks_vs_whole_k_prefetch pipeline to improve performance on MI350 GPUs through better MFMA instruction usage, transposed V-loading support, and N0-loop implementation. The pipeline targets scenarios where work-group counts are low, enabling better CU occupancy by using smaller MTile sizes (kM0=64 vs 128) while prefetching entire K tiles.

Changes:

Adds transposed V-loading support (qr_ks_vs_whole_k_prefetch_trload) to reduce shuffle instructions on MI350
Implements N0-loop based Gemm0 to reduce tile window movement overhead and eliminate clear_tile calls
Adds full support for hdim96/hdim160 without padding requirements
Updates MFMA instruction selection to ensure optimal choices for MI350

Performance results

For attention shapes which leads to kM0=64, qr_ks_vs_async_whole_k_prefetch_trload shows much better performance than qr_ks_vs_async_trload on the same case (execution time 41.02ms by whole_k_prefetch_trload & 58.50ms by async_load), and qr_ks_vs_async_whole_k_prefetch_trload also shows obviously better performance the recently tuned qr_ks_vs_async on the same case (execution time 41.02ms by whole_k_prefetch_trload 7 42.96ms by qr_ks_vs_async)
Also on MI300, for attention shapes which leads to kM0=64 qr_ks_vs_async_whole_k_prefetch shows much better performance the qr_ks_vs_async (which is supposed to be very high-efficient) on the same case (execution time 78.02ms by whole_k_prefetch & 100.20ms by qr_ks_vs_async)
For attention shapes which leads to kM0=128, qr_ks_vs_async_whole_k_prefetch_trload show a little bit better performance than qr_ks_vs_async on mi350 (execution time 104.50ms by whole_k_prefetch_trload & 106.50ms by qr_ks_vs_async). And they shows completely on-par performance on MI300

Test/Verify

Use the ROCM xformers branch test_whole_k_prefetch_n0loop to test/verify qr_ks_vs_whole_k_prefetch pipeline since this pipeline can not be used by ck_tile fmha example so far
Use the following command-line for building/testing xformers

#> git clone -b test_whole_k_prefetch_n0loop https://github.com/ROCm/xformers
#> git submodule update --init --recursive   
#> pip  install --no-build-isolation -e ./
#> pytest tests/test_mem_eff_attention.py::test_forward

Any scripts which can run on xformers can be used to evaluate qr_ks_vs_whole_k_prefetch pipeline. Using the two environ variable to switch from using different pipelines

#> export FMHA_DISABLE_SPECIAL_TREATMENT=1              #> to disable using FAV3 and qr_ks_vs_async_trload pipeline
#> export FMHA_ENABLE_ASYNC_PIPELINE=1                     #>  to disable using qr_ks_vs_async pipeline for comparing

Discussion

…oping Gemm0 along n0 dimension

…8 on mi350

… ...)

…e_k_prefetch pipeline

…n whole_k_prefetch path)

…n whole_k_prefetch path in trload pipeline)

… next iteration in the non-whole-k-perfetch path

Copilot

Pull request overview

This PR updates and enhances the qr_ks_vs_whole_k_prefetch pipeline to improve performance on MI350 GPUs through better MFMA instruction usage, transposed V-loading support, and N0-loop implementation. The pipeline targets scenarios where work-group counts are low, enabling better CU occupancy by using smaller MTile sizes (kM0=64 vs 128) while prefetching entire K tiles.

Changes:

Adds transposed V-loading support (qr_ks_vs_whole_k_prefetch_trload) to reduce shuffle instructions on MI350
Implements N0-loop based Gemm0 to reduce tile window movement overhead and eliminate clear_tile calls
Adds full support for hdim96/hdim160 without padding requirements
Updates MFMA instruction selection to ensure optimal choices for MI350

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
block_gemm_areg_bsmem_trload_creg_v2_prefetch_n.hpp	New GEMM block implementation supporting transposed V-loading with N-dimension prefetching
block_gemm_areg_bsmem_creg_v2_prefetch_n.hpp	N-dimension prefetching GEMM implementation for standard (non-transposed) loading
block_gemm_areg_bsmem_creg_v2_prefetch_k.hpp	K-dimension prefetching GEMM implementation
tile_fmha_shape.hpp	Adds kN0Sub field and relaxes static assertion for N0-loop support
block_fmha_pipeline_qr_ks_vs_whole_k_prefetch_trload.hpp	New pipeline variant with transposed V-loading
block_fmha_pipeline_qr_ks_vs_whole_k_prefetch_default_policy.hpp	Comprehensive policy updates for LDS management, alignment, and MFMA selection
block_fmha_pipeline_qr_ks_vs_whole_k_prefetch.hpp	Core pipeline updated with N0-loop implementation and simplified memory management
block_fmha_pipeline_problem.hpp	Adds utility functions for calculating optimal vector sizes
fmha_fwd_kernel.hpp	Kernel updates to support N0-loop pipelines and naive hdim loading
fmha.hpp	Includes new trload pipeline header

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

include/ck_tile/ops/gemm/block/block_gemm_areg_bsmem_creg_v2_prefetch_n.hpp

include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs_whole_k_prefetch.hpp

include/ck_tile/ops/fmha/pipeline/tile_fmha_shape.hpp

include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs_whole_k_prefetch.hpp

include/ck_tile/ops/fmha/kernel/fmha_fwd_kernel.hpp

include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_problem.hpp

asleepzzz · 2026-01-28T08:26:58Z

we found async can beat wholek with a new config, will discuss with qianfeng

ammallya · 2026-02-03T22:06:30Z

Error importing due to merge conflicts – please reopen the PR on ROCm/rocm-libraries

qianfengz added 25 commits December 4, 2025 09:09

Initial re-implementation of pipeline qr_ks_vs_whole_k_prefetch in lo…

5fada1c

…oping Gemm0 along n0 dimension

Add prefetching whole next iteration K path in the pipeline

98f9b4a

Change in GetKVBlockGemm to let gemm1 to use WarpTile-16x16x16/32x32x…

c32949b

…8 on mi350

Switch the codes based on the iteration index (first/intermediate/last)

25521a7

Simplify the block_gemm codes

8b85919

[Performance] Change __builtin_amdgcn_sched_barrier() in block_gemm

5722f8a

Refine the interleaving in the loop of Gemm0

044f554

Using explicit vgpr-saved partition_index with store_tile(lds_window,…

2ea8d83

… ...)

Separate kN0Sub from kK0 to be used for flexible tile tuning for whol…

12c8873

…e_k_prefetch pipeline

Load Q through Lds

c3d3487

Fix move_tile_window(k_dram_window, ..) step in the pipeline

409ec3b

Remove replicated codes in the pipeline

370d386

Adjust in GetNumPrefetchV()

d281c51

Add support of loading QK tiles of hdim96 without padding to hdim128

384f470

Add qr_ks_vs_whole_k_prefetch_trload pipeline

eb598a9

Using is_using_trload_v to check the kUseTrLoad from pipeline

57abd10

Load Q directly from global memory to registers for BlockGemm

3f6d26e

Fix the static_assert expression in the pipeline

1ef76a6

Update to the non-whole-k-prefetch path in the whoke_k_prefetch pipeline

db5c12d

Update to only pre-load one v_tile during Gemm0 loop

57cf989

Move the loading of k_file for next iteration into the Gemm1 loop (no…

b77fdbf

…n whole_k_prefetch path)

Update to GetNumPrefetchV()

e7e6ebc

Move the loading of k_tile for next iteration into the Gemm1 loop (no…

6c91b0c

…n whole_k_prefetch path in trload pipeline)

Update to GetNumPrefetchV() for kM0=64 path

489e255

Update in whole_k_prefetch_trload pipeline to prefetch two k_tile for…

f5b4d5d

… next iteration in the non-whole-k-perfetch path

qianfengz requested review from aosewski, carlushuang, geyyer, illsilin and poyenc as code owners December 24, 2025 10:00

qianfengz requested review from ThomasNing, afagaj, andriy-ca, asleepzzz, bartekxk, cgmillette, coderfeli, shumway, tenpercent and vidyasagar-amd as code owners December 24, 2025 10:00

poyenc requested a review from Copilot January 15, 2026 06:40

poyenc assigned qianfengz Jan 15, 2026

Copilot started reviewing on behalf of poyenc January 15, 2026 06:41 View session

Copilot AI reviewed Jan 15, 2026

View reviewed changes

illsilin assigned asleepzzz Jan 27, 2026

Remove un-used index constant definition

eba924d

qianfengz requested review from Snektron and vpietila-amd as code owners February 14, 2026 08:18

qianfengz added 4 commits February 24, 2026 07:37

Fix sched_barrier mask value

62cf370

Fix in comments

4d83c92

Remove some not very much required interfaces from pipeline problem

b78a240

Merge branch 'develop' into whole_k_prefetch_n0loop

a0ad368

asleepzzz approved these changes Mar 17, 2026

View reviewed changes

asleepzzz enabled auto-merge (squash) March 17, 2026 03:23

Merge branch 'develop' into whole_k_prefetch_n0loop

1a8d58a

qianfengz disabled auto-merge March 17, 2026 03:38

qianfengz enabled auto-merge (squash) March 17, 2026 03:38

qianfengz disabled auto-merge March 17, 2026 03:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update/add to qr_ks_vs_whole_k_prefetch pipeline#3485

Update/add to qr_ks_vs_whole_k_prefetch pipeline#3485
qianfengz wants to merge 31 commits intodevelopfrom
whole_k_prefetch_n0loop

qianfengz commented Dec 24, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

asleepzzz commented Jan 28, 2026

Uh oh!

ammallya commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

qianfengz commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

About qr_ks_vs_whole_k_prefetch pipeline

Changes:

Performance results

Test/Verify

Discussion

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

asleepzzz commented Jan 28, 2026

Uh oh!

ammallya commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

qianfengz commented Dec 24, 2025 •

edited

Loading