Skip to content

Update/add to qr_ks_vs_whole_k_prefetch pipeline#3485

Open
qianfengz wants to merge 31 commits intodevelopfrom
whole_k_prefetch_n0loop
Open

Update/add to qr_ks_vs_whole_k_prefetch pipeline#3485
qianfengz wants to merge 31 commits intodevelopfrom
whole_k_prefetch_n0loop

Conversation

@qianfengz
Copy link
Contributor

@qianfengz qianfengz commented Dec 24, 2025

About qr_ks_vs_whole_k_prefetch pipeline

This PR updates and enhances the qr_ks_vs_whole_k_prefetch pipeline to improve performance on MI350 GPUs through better MFMA instruction usage, transposed V-loading support, and N0-loop implementation. The pipeline targets scenarios where work-group counts are low, enabling better CU occupancy by using smaller MTile sizes (kM0=64 vs 128) while prefetching entire K tiles.

Changes:

Adds transposed V-loading support (qr_ks_vs_whole_k_prefetch_trload) to reduce shuffle instructions on MI350
Implements N0-loop based Gemm0 to reduce tile window movement overhead and eliminate clear_tile calls
Adds full support for hdim96/hdim160 without padding requirements
Updates MFMA instruction selection to ensure optimal choices for MI350

Performance results

  1. For attention shapes which leads to kM0=64, qr_ks_vs_async_whole_k_prefetch_trload shows much better performance than qr_ks_vs_async_trload on the same case (execution time 41.02ms by whole_k_prefetch_trload & 58.50ms by async_load), and qr_ks_vs_async_whole_k_prefetch_trload also shows obviously better performance the recently tuned qr_ks_vs_async on the same case (execution time 41.02ms by whole_k_prefetch_trload 7 42.96ms by qr_ks_vs_async)
  2. Also on MI300, for attention shapes which leads to kM0=64 qr_ks_vs_async_whole_k_prefetch shows much better performance the qr_ks_vs_async (which is supposed to be very high-efficient) on the same case (execution time 78.02ms by whole_k_prefetch & 100.20ms by qr_ks_vs_async)
  3. For attention shapes which leads to kM0=128, qr_ks_vs_async_whole_k_prefetch_trload show a little bit better performance than qr_ks_vs_async on mi350 (execution time 104.50ms by whole_k_prefetch_trload & 106.50ms by qr_ks_vs_async). And they shows completely on-par performance on MI300

Test/Verify

  1. Use the ROCM xformers branch test_whole_k_prefetch_n0loop to test/verify qr_ks_vs_whole_k_prefetch pipeline since this pipeline can not be used by ck_tile fmha example so far
  2. Use the following command-line for building/testing xformers
#> git clone -b test_whole_k_prefetch_n0loop https://github.com/ROCm/xformers
#> git submodule update --init --recursive   
#> pip  install --no-build-isolation -e ./
#> pytest tests/test_mem_eff_attention.py::test_forward
  1. Any scripts which can run on xformers can be used to evaluate qr_ks_vs_whole_k_prefetch pipeline. Using the two environ variable to switch from using different pipelines
#> export FMHA_DISABLE_SPECIAL_TREATMENT=1              #> to disable using FAV3 and qr_ks_vs_async_trload pipeline
#> export FMHA_ENABLE_ASYNC_PIPELINE=1                     #>  to disable using qr_ks_vs_async pipeline for comparing

Discussion

… next iteration in the non-whole-k-perfetch path
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates and enhances the qr_ks_vs_whole_k_prefetch pipeline to improve performance on MI350 GPUs through better MFMA instruction usage, transposed V-loading support, and N0-loop implementation. The pipeline targets scenarios where work-group counts are low, enabling better CU occupancy by using smaller MTile sizes (kM0=64 vs 128) while prefetching entire K tiles.

Changes:

  • Adds transposed V-loading support (qr_ks_vs_whole_k_prefetch_trload) to reduce shuffle instructions on MI350
  • Implements N0-loop based Gemm0 to reduce tile window movement overhead and eliminate clear_tile calls
  • Adds full support for hdim96/hdim160 without padding requirements
  • Updates MFMA instruction selection to ensure optimal choices for MI350

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
block_gemm_areg_bsmem_trload_creg_v2_prefetch_n.hpp New GEMM block implementation supporting transposed V-loading with N-dimension prefetching
block_gemm_areg_bsmem_creg_v2_prefetch_n.hpp N-dimension prefetching GEMM implementation for standard (non-transposed) loading
block_gemm_areg_bsmem_creg_v2_prefetch_k.hpp K-dimension prefetching GEMM implementation
tile_fmha_shape.hpp Adds kN0Sub field and relaxes static assertion for N0-loop support
block_fmha_pipeline_qr_ks_vs_whole_k_prefetch_trload.hpp New pipeline variant with transposed V-loading
block_fmha_pipeline_qr_ks_vs_whole_k_prefetch_default_policy.hpp Comprehensive policy updates for LDS management, alignment, and MFMA selection
block_fmha_pipeline_qr_ks_vs_whole_k_prefetch.hpp Core pipeline updated with N0-loop implementation and simplified memory management
block_fmha_pipeline_problem.hpp Adds utility functions for calculating optimal vector sizes
fmha_fwd_kernel.hpp Kernel updates to support N0-loop pipelines and naive hdim loading
fmha.hpp Includes new trload pipeline header

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@asleepzzz
Copy link
Contributor

we found async can beat wholek with a new config, will discuss with qianfeng

@ammallya
Copy link

ammallya commented Feb 3, 2026

Error importing due to merge conflicts – please reopen the PR on ROCm/rocm-libraries

@asleepzzz asleepzzz enabled auto-merge (squash) March 17, 2026 03:23
@qianfengz qianfengz disabled auto-merge March 17, 2026 03:38
@qianfengz qianfengz enabled auto-merge (squash) March 17, 2026 03:38
@qianfengz qianfengz disabled auto-merge March 17, 2026 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants