vulkan : transpose A-matrix data layout for K-quant mul_mat performance by Alex-JP-93 · Pull Request #22970 · ggml-org/llama.cpp

Alex-JP-93 · 2026-05-12T10:30:32Z

Overview

This PR optimizes Vulkan prompt processing performance specifically for K-quant formats including Q4_K, Q5_K and Q6_K.

The core change repacks weight matrices during tensor upload. Previously mul_mat accessed matrix A with non-contiguous stride-based memory access, which severely hurt L1 cache hit rate given K-quant block sizes (roughly 144–176 bytes). By remapping the layout from [row, k_block] to [k_block, row] inside set_tensor, shader reads for matrix A become fully sequential and cache-friendly.

Key implementation details:

Matrix transpose only runs once at tensor upload time. get_tensor reverses the layout transparently, so existing test frameworks see unchanged original tensor layout.
Optimization only applies to non-batched weight tensors where ne[2] == ne[3] == 1, and excludes token embedding tensors. Batched matrix A retains its original layout unchanged.
Both standard and _transa shader variants are compiled; runtime dispatch automatically selects the correct variant based on tensor layout.
The entire optimization can be disabled by setting GGML_VK_NO_TRANSPOSE_A=1.

Benchmarks collected on AMD Radeon AI PRO R9700 (RDNA4, gfx1201, wave64, KHR_coopmat):

Model	Test	Before	After	Change
Qwen3.5-9B Q4_K_M	pp512	2554 t/s	2758 t/s	+8.0%
Qwen3.5-9B Q4_K_M	tg128	92.08 t/s	92.24 t/s	+0.2%
Qwen3-8B Q4_K_M	pp512	2936 t/s	3106 t/s	+5.8%
Gemma-4 E4B Q4_K_M	pp512	4478 t/s	4672 t/s	+4.3%
Gemma-4 E4B Q6_K	pp512	4022 t/s	4475 t/s	+11.3%
MUL_MAT Q6_K m=4096 n=512 k=14336 (test-backend-ops)	—	38.21 TF	44.03 TF	+15.2%

Full prompt/generation sweep results are attached below. Generation throughput remains nearly flat within ±1% across all tests. Q6_K sees the largest prompt-side gains because its per-block dequantization has higher overhead; eliminating irregular memory loads provides more noticeable relative speedup.

Additional information

Files changed

ggml-vulkan.cpp: Added layout repacking in set_tensor, inverse remapping in get_tensor, implemented _transa pipeline definitions and runtime variant selection logic.
mul_mat_vec_q4_k/q5_k/q6_k.comp: Updated indexing logic to respect transpose layout via runtime flag.
mul_mat_vecq.comp: Applied same transpose-aware indexing for integer dot product K-quant computation path.
mul_mat_vec_iface.glsl: Defined MAT_VEC_FUSION_FLAGS_TRANSPOSE_A flag.
mul_mm.comp, mul_mm_funcs.glsl: Introduced TRANSPOSE_A compile-time macro.
vulkan-shaders-gen.cpp: Added code generation for _transa and _transa_aligned shader variants.

Performance

All tests run on AMD Radeon AI PRO R9700 (RDNA4, gfx1201).

Qwen3-8B Q4_K_M (prompt generation, fixed A/B config):

Prompt Size	Before (t/s)	After (t/s)	Change
pp128	2748	2908	+5.8%
pp256	2872	3025	+5.3%
pp512	2936	3106	+5.8%
pp1024	2878	3033	+5.4%
pp2048	2676	2804	+4.8%

Qwen3.5-9B Q4_K_M (prompt generation, fixed A/B config):

Prompt Size	Before (t/s)	After (t/s)	Change
pp128	2409	2611	+8.4%
pp256	2492	2684	+7.7%
pp512	2554	2758	+8.0%
pp1024	2560	2772	+8.3%
pp2048	2550	2757	+8.1%

Qwen3-8B Q4_K_M (token generation, fixed A/B config):

tg size	Before (t/s)	After (t/s)	Change
tg32	104.40	105.35	+0.9%
tg64	104.50	105.42	+0.9%
tg128	104.54	105.35	+0.8%
tg256	104.22	105.29	+1.0%
tg512	103.08	104.31	+1.2%

Qwen3.5-9B Q4_K_M (token generation, fixed A/B config):

tg size	Before (t/s)	After (t/s)	Change
tg32	91.41	91.77	+0.4%
tg64	91.89	92.14	+0.3%
tg128	92.08	92.24	+0.2%
tg256	91.25	92.13	+1.0%
tg512	91.67	91.94	+0.3%

Gemma-4 E4B (fixed A/B, pp512 / pp8192 / tg128):

Model	Test	Before (t/s)	After (t/s)	Change
Q4_K_M	pp512	4478.26	4671.56	+4.3%
Q4_K_M	pp8192	4315.98	4523.42	+4.8%
Q4_K_M	tg128	129.31	130.12	+0.6%
Q6_K	pp512	4022.22	4474.85	+11.3%
Q6_K	pp8192	3902.29	4333.61	+11.1%
Q6_K	tg128	109.36	109.50	+0.1%

Q6_K delivers noticeably larger prompt throughput gains compared to Q4_K_M on the same model. This aligns with standalone MUL_MAT operator results: Q6_K requires more work for per-block dequantization and sub-scale unpacking, so removing inefficient scattered memory access yields a bigger performance uplift.

No token generation performance regressions observed across all tested K-quant models. Minor 0.1–1% gains are attributed to improved cache locality in the underlying mul_mat_vec computation path.

Single operator MUL_MAT benchmark (test-backend-ops, m=4096, n=512, k=14336):

Type	Before (TFLOPS)	After (TFLOPS)	Change
Q4_K	43.97	45.49	+3.5%
Q5_K	42.05	43.32	+3.0%
Q6_K	38.21	44.03	+15.2%

Extending to other quant types

The same optimization pattern can be ported to additional quantization formats following these steps:

Update corresponding mul_mat_vec_*.comp shaders with transpose-aware memory indexing.
Add _transa / _transa_aligned variant generation in vulkan-shaders-gen.cpp.
Register new _transa pipeline entries inside ggml-vulkan.cpp.
Append the new quant type to four relevant lookup lists in ggml-vulkan.cpp covering set_tensor, get_tensor, mul_mat_q_f16 and mul_mat_vec_q_f16.

Environment variable

GGML_VK_NO_TRANSPOSE_A=1: Fully disables the matrix A transpose optimization for all supported quant types. Useful for debugging logic regression and raw baseline performance comparison.

Correctness validation

Ran test-backend-ops -b Vulkan0 -o MUL_MAT; all Q4_K/Q5_K/Q6_K and existing quant types pass correctness checks. Setting GGML_VK_NO_TRANSPOSE_A=1 correctly reverts to original baseline performance numbers.

Future work

I plan to implement a similar matrix B transpose optimization for the int8 MMQ computation path in a follow-up PR, once the required foundational changes are merged upstream.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES. I used GitHub Copilot only for basic code autocompletion and tidying up raw benchmark data into readable tables. All architectural design, C++/shader implementation, benchmark configuration, performance analysis and final PR content writing are done entirely by myself. I fully understand every part of this change and can answer any related technical questions in detail.

Alex-JP-93 · 2026-05-12T10:55:11Z

Hey, just saw the bot reminder. I've updated the PR description to be more straightforward. All the C++/GLSL code and benchmark data are my own work, so feel free to ask me any questions about the changes.

TheBlueMatt · 2026-05-12T16:59:42Z

Heh, that's now a third repacking form for vulkan with #22951 and #21024. Maybe we need to have a higher-level discussion somewhere about approaches and designs.

Alex-JP-93 · 2026-05-13T03:28:38Z

Heh, that's now a third repacking form for vulkan with #22951 and #21024. Maybe we need to have a higher-level discussion somewhere about approaches and designs.

Yeah, fair point — three different repacking shapes is starting to feel like a lot. I'm happy to join whatever discussion form works best (issue / discussion thread / RFC). If @0cc4m or other Vulkan maintainers want to set the direction first, I can adapt this PR's mechanism (or rebase on a shared one) once a common design lands.

TheBlueMatt · 2026-05-13T13:45:12Z

Yea, i have no idea what the process is (or should be?) someone's gotta make a decision but its also true we need to explore options first to see what options we have and what the performance of those different options is. Probably first step is coalescing the options we have and trying to write up a unified description and performance (or missing benchmarks) of the patches that exist in an issue somewehere.

0cc4m · 2026-05-13T15:48:02Z

The process is simply that this is an ongoing discussion among maintainers and I won't merge any repacking PR until we come to a conclusion on what the right approach is. Gathering performance data is fine, but don't expect it to go anywhere within a short time frame.

TheBlueMatt · 2026-05-13T16:10:04Z

Sorry I didn't mean to imply there was no process, only that I don't know it :). It does seem potentially useful to have a tracking issue that breaks down the options we've come up with and also tracks branches of them so we can compare performance results across devices/drivers?

vulkan : transpose A-matrix data layout for K-quant mul_mat performance

e85a428

Alex-JP-93 requested a review from a team as a code owner May 12, 2026 10:30

This comment was marked as resolved.

Sign in to view

Alex-JP-93 closed this May 12, 2026

Alex-JP-93 reopened this May 12, 2026

github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels May 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan : transpose A-matrix data layout for K-quant mul_mat performance#22970

vulkan : transpose A-matrix data layout for K-quant mul_mat performance#22970
Alex-JP-93 wants to merge 1 commit into
ggml-org:masterfrom
Alex-JP-93:vulkan-transpose-a-kquant

Alex-JP-93 commented May 12, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Alex-JP-93 commented May 12, 2026

Uh oh!

TheBlueMatt commented May 12, 2026

Uh oh!

Alex-JP-93 commented May 13, 2026

Uh oh!

TheBlueMatt commented May 13, 2026 •

edited

Loading

Uh oh!

0cc4m commented May 13, 2026

Uh oh!

TheBlueMatt commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Alex-JP-93 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Files changed

Performance

Extending to other quant types

Environment variable

Correctness validation

Future work

Requirements

Uh oh!

This comment was marked as resolved.

Alex-JP-93 commented May 12, 2026

Uh oh!

TheBlueMatt commented May 12, 2026

Uh oh!

Alex-JP-93 commented May 13, 2026

Uh oh!

TheBlueMatt commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented May 13, 2026

Uh oh!

TheBlueMatt commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Alex-JP-93 commented May 12, 2026 •

edited

Loading

TheBlueMatt commented May 13, 2026 •

edited

Loading