Skip to content

vulkan : transpose A-matrix data layout for K-quant mul_mat performance#22970

Open
Alex-JP-93 wants to merge 1 commit into
ggml-org:masterfrom
Alex-JP-93:vulkan-transpose-a-kquant
Open

vulkan : transpose A-matrix data layout for K-quant mul_mat performance#22970
Alex-JP-93 wants to merge 1 commit into
ggml-org:masterfrom
Alex-JP-93:vulkan-transpose-a-kquant

Conversation

@Alex-JP-93
Copy link
Copy Markdown

@Alex-JP-93 Alex-JP-93 commented May 12, 2026

Overview

This PR optimizes Vulkan prompt processing performance specifically for K-quant formats including Q4_K, Q5_K and Q6_K.

The core change repacks weight matrices during tensor upload. Previously mul_mat accessed matrix A with non-contiguous stride-based memory access, which severely hurt L1 cache hit rate given K-quant block sizes (roughly 144–176 bytes). By remapping the layout from [row, k_block] to [k_block, row] inside set_tensor, shader reads for matrix A become fully sequential and cache-friendly.

Key implementation details:

  • Matrix transpose only runs once at tensor upload time. get_tensor reverses the layout transparently, so existing test frameworks see unchanged original tensor layout.
  • Optimization only applies to non-batched weight tensors where ne[2] == ne[3] == 1, and excludes token embedding tensors. Batched matrix A retains its original layout unchanged.
  • Both standard and _transa shader variants are compiled; runtime dispatch automatically selects the correct variant based on tensor layout.
  • The entire optimization can be disabled by setting GGML_VK_NO_TRANSPOSE_A=1.

Benchmarks collected on AMD Radeon AI PRO R9700 (RDNA4, gfx1201, wave64, KHR_coopmat):

Model Test Before After Change
Qwen3.5-9B Q4_K_M pp512 2554 t/s 2758 t/s +8.0%
Qwen3.5-9B Q4_K_M tg128 92.08 t/s 92.24 t/s +0.2%
Qwen3-8B Q4_K_M pp512 2936 t/s 3106 t/s +5.8%
Gemma-4 E4B Q4_K_M pp512 4478 t/s 4672 t/s +4.3%
Gemma-4 E4B Q6_K pp512 4022 t/s 4475 t/s +11.3%
MUL_MAT Q6_K m=4096 n=512 k=14336 (test-backend-ops) 38.21 TF 44.03 TF +15.2%

Full prompt/generation sweep results are attached below. Generation throughput remains nearly flat within ±1% across all tests. Q6_K sees the largest prompt-side gains because its per-block dequantization has higher overhead; eliminating irregular memory loads provides more noticeable relative speedup.

Additional information

Files changed

  • ggml-vulkan.cpp: Added layout repacking in set_tensor, inverse remapping in get_tensor, implemented _transa pipeline definitions and runtime variant selection logic.
  • mul_mat_vec_q4_k/q5_k/q6_k.comp: Updated indexing logic to respect transpose layout via runtime flag.
  • mul_mat_vecq.comp: Applied same transpose-aware indexing for integer dot product K-quant computation path.
  • mul_mat_vec_iface.glsl: Defined MAT_VEC_FUSION_FLAGS_TRANSPOSE_A flag.
  • mul_mm.comp, mul_mm_funcs.glsl: Introduced TRANSPOSE_A compile-time macro.
  • vulkan-shaders-gen.cpp: Added code generation for _transa and _transa_aligned shader variants.

Performance

All tests run on AMD Radeon AI PRO R9700 (RDNA4, gfx1201).

Qwen3-8B Q4_K_M (prompt generation, fixed A/B config):

Prompt Size Before (t/s) After (t/s) Change
pp128 2748 2908 +5.8%
pp256 2872 3025 +5.3%
pp512 2936 3106 +5.8%
pp1024 2878 3033 +5.4%
pp2048 2676 2804 +4.8%

Qwen3.5-9B Q4_K_M (prompt generation, fixed A/B config):

Prompt Size Before (t/s) After (t/s) Change
pp128 2409 2611 +8.4%
pp256 2492 2684 +7.7%
pp512 2554 2758 +8.0%
pp1024 2560 2772 +8.3%
pp2048 2550 2757 +8.1%

Qwen3-8B Q4_K_M (token generation, fixed A/B config):

tg size Before (t/s) After (t/s) Change
tg32 104.40 105.35 +0.9%
tg64 104.50 105.42 +0.9%
tg128 104.54 105.35 +0.8%
tg256 104.22 105.29 +1.0%
tg512 103.08 104.31 +1.2%

Qwen3.5-9B Q4_K_M (token generation, fixed A/B config):

tg size Before (t/s) After (t/s) Change
tg32 91.41 91.77 +0.4%
tg64 91.89 92.14 +0.3%
tg128 92.08 92.24 +0.2%
tg256 91.25 92.13 +1.0%
tg512 91.67 91.94 +0.3%

Gemma-4 E4B (fixed A/B, pp512 / pp8192 / tg128):

Model Test Before (t/s) After (t/s) Change
Q4_K_M pp512 4478.26 4671.56 +4.3%
Q4_K_M pp8192 4315.98 4523.42 +4.8%
Q4_K_M tg128 129.31 130.12 +0.6%
Q6_K pp512 4022.22 4474.85 +11.3%
Q6_K pp8192 3902.29 4333.61 +11.1%
Q6_K tg128 109.36 109.50 +0.1%

Q6_K delivers noticeably larger prompt throughput gains compared to Q4_K_M on the same model. This aligns with standalone MUL_MAT operator results: Q6_K requires more work for per-block dequantization and sub-scale unpacking, so removing inefficient scattered memory access yields a bigger performance uplift.

No token generation performance regressions observed across all tested K-quant models. Minor 0.1–1% gains are attributed to improved cache locality in the underlying mul_mat_vec computation path.

Single operator MUL_MAT benchmark (test-backend-ops, m=4096, n=512, k=14336):

Type Before (TFLOPS) After (TFLOPS) Change
Q4_K 43.97 45.49 +3.5%
Q5_K 42.05 43.32 +3.0%
Q6_K 38.21 44.03 +15.2%

Extending to other quant types

The same optimization pattern can be ported to additional quantization formats following these steps:

  1. Update corresponding mul_mat_vec_*.comp shaders with transpose-aware memory indexing.
  2. Add _transa / _transa_aligned variant generation in vulkan-shaders-gen.cpp.
  3. Register new _transa pipeline entries inside ggml-vulkan.cpp.
  4. Append the new quant type to four relevant lookup lists in ggml-vulkan.cpp covering set_tensor, get_tensor, mul_mat_q_f16 and mul_mat_vec_q_f16.

Environment variable

  • GGML_VK_NO_TRANSPOSE_A=1: Fully disables the matrix A transpose optimization for all supported quant types. Useful for debugging logic regression and raw baseline performance comparison.

Correctness validation

Ran test-backend-ops -b Vulkan0 -o MUL_MAT; all Q4_K/Q5_K/Q6_K and existing quant types pass correctness checks. Setting GGML_VK_NO_TRANSPOSE_A=1 correctly reverts to original baseline performance numbers.

Future work

I plan to implement a similar matrix B transpose optimization for the int8 MMQ computation path in a follow-up PR, once the required foundational changes are merged upstream.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES. I used GitHub Copilot only for basic code autocompletion and tidying up raw benchmark data into readable tables. All architectural design, C++/shader implementation, benchmark configuration, performance analysis and final PR content writing are done entirely by myself. I fully understand every part of this change and can answer any related technical questions in detail.

@Alex-JP-93 Alex-JP-93 requested a review from a team as a code owner May 12, 2026 10:30
@ggml-gh-bot

This comment was marked as resolved.

@Alex-JP-93 Alex-JP-93 closed this May 12, 2026
@Alex-JP-93 Alex-JP-93 reopened this May 12, 2026
@Alex-JP-93
Copy link
Copy Markdown
Author

Hey, just saw the bot reminder. I've updated the PR description to be more straightforward. All the C++/GLSL code and benchmark data are my own work, so feel free to ask me any questions about the changes.

@github-actions github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels May 12, 2026
@TheBlueMatt
Copy link
Copy Markdown
Contributor

Heh, that's now a third repacking form for vulkan with #22951 and #21024. Maybe we need to have a higher-level discussion somewhere about approaches and designs.

@Alex-JP-93
Copy link
Copy Markdown
Author

Heh, that's now a third repacking form for vulkan with #22951 and #21024. Maybe we need to have a higher-level discussion somewhere about approaches and designs.

Yeah, fair point — three different repacking shapes is starting to feel like a lot. I'm happy to join whatever discussion form works best (issue / discussion thread / RFC). If @0cc4m or other Vulkan maintainers want to set the direction first, I can adapt this PR's mechanism (or rebase on a shared one) once a common design lands.

@TheBlueMatt
Copy link
Copy Markdown
Contributor

TheBlueMatt commented May 13, 2026

Yea, i have no idea what the process is (or should be?) someone's gotta make a decision but its also true we need to explore options first to see what options we have and what the performance of those different options is. Probably first step is coalescing the options we have and trying to write up a unified description and performance (or missing benchmarks) of the patches that exist in an issue somewehere.

@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented May 13, 2026

The process is simply that this is an ongoing discussion among maintainers and I won't merge any repacking PR until we come to a conclusion on what the right approach is. Gathering performance data is fine, but don't expect it to go anywhere within a short time frame.

@TheBlueMatt
Copy link
Copy Markdown
Contributor

Sorry I didn't mean to imply there was no process, only that I don't know it :). It does seem potentially useful to have a tracking issue that breaks down the options we've come up with and also tracks branches of them so we can compare performance results across devices/drivers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants