vulkan : transpose A-matrix data layout for K-quant mul_mat performance#22970
vulkan : transpose A-matrix data layout for K-quant mul_mat performance#22970Alex-JP-93 wants to merge 1 commit into
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
|
Hey, just saw the bot reminder. I've updated the PR description to be more straightforward. All the C++/GLSL code and benchmark data are my own work, so feel free to ask me any questions about the changes. |
Yeah, fair point — three different repacking shapes is starting to feel like a lot. I'm happy to join whatever discussion form works best (issue / discussion thread / RFC). If @0cc4m or other Vulkan maintainers want to set the direction first, I can adapt this PR's mechanism (or rebase on a shared one) once a common design lands. |
|
Yea, i have no idea what the process is (or should be?) someone's gotta make a decision but its also true we need to explore options first to see what options we have and what the performance of those different options is. Probably first step is coalescing the options we have and trying to write up a unified description and performance (or missing benchmarks) of the patches that exist in an issue somewehere. |
|
The process is simply that this is an ongoing discussion among maintainers and I won't merge any repacking PR until we come to a conclusion on what the right approach is. Gathering performance data is fine, but don't expect it to go anywhere within a short time frame. |
|
Sorry I didn't mean to imply there was no process, only that I don't know it :). It does seem potentially useful to have a tracking issue that breaks down the options we've come up with and also tracks branches of them so we can compare performance results across devices/drivers? |
Overview
This PR optimizes Vulkan prompt processing performance specifically for K-quant formats including Q4_K, Q5_K and Q6_K.
The core change repacks weight matrices during tensor upload. Previously
mul_mataccessed matrix A with non-contiguous stride-based memory access, which severely hurt L1 cache hit rate given K-quant block sizes (roughly 144–176 bytes). By remapping the layout from[row, k_block]to[k_block, row]insideset_tensor, shader reads for matrix A become fully sequential and cache-friendly.Key implementation details:
get_tensorreverses the layout transparently, so existing test frameworks see unchanged original tensor layout.ne[2] == ne[3] == 1, and excludes token embedding tensors. Batched matrix A retains its original layout unchanged._transashader variants are compiled; runtime dispatch automatically selects the correct variant based on tensor layout.GGML_VK_NO_TRANSPOSE_A=1.Benchmarks collected on AMD Radeon AI PRO R9700 (RDNA4, gfx1201, wave64, KHR_coopmat):
Full prompt/generation sweep results are attached below. Generation throughput remains nearly flat within ±1% across all tests. Q6_K sees the largest prompt-side gains because its per-block dequantization has higher overhead; eliminating irregular memory loads provides more noticeable relative speedup.
Additional information
Files changed
ggml-vulkan.cpp: Added layout repacking inset_tensor, inverse remapping inget_tensor, implemented_transapipeline definitions and runtime variant selection logic.mul_mat_vec_q4_k/q5_k/q6_k.comp: Updated indexing logic to respect transpose layout via runtime flag.mul_mat_vecq.comp: Applied same transpose-aware indexing for integer dot product K-quant computation path.mul_mat_vec_iface.glsl: DefinedMAT_VEC_FUSION_FLAGS_TRANSPOSE_Aflag.mul_mm.comp,mul_mm_funcs.glsl: IntroducedTRANSPOSE_Acompile-time macro.vulkan-shaders-gen.cpp: Added code generation for_transaand_transa_alignedshader variants.Performance
All tests run on AMD Radeon AI PRO R9700 (RDNA4, gfx1201).
Qwen3-8B Q4_K_M (prompt generation, fixed A/B config):
Qwen3.5-9B Q4_K_M (prompt generation, fixed A/B config):
Qwen3-8B Q4_K_M (token generation, fixed A/B config):
Qwen3.5-9B Q4_K_M (token generation, fixed A/B config):
Gemma-4 E4B (fixed A/B, pp512 / pp8192 / tg128):
Q6_K delivers noticeably larger prompt throughput gains compared to Q4_K_M on the same model. This aligns with standalone MUL_MAT operator results: Q6_K requires more work for per-block dequantization and sub-scale unpacking, so removing inefficient scattered memory access yields a bigger performance uplift.
No token generation performance regressions observed across all tested K-quant models. Minor 0.1–1% gains are attributed to improved cache locality in the underlying
mul_mat_veccomputation path.Single operator MUL_MAT benchmark (test-backend-ops, m=4096, n=512, k=14336):
Extending to other quant types
The same optimization pattern can be ported to additional quantization formats following these steps:
mul_mat_vec_*.compshaders with transpose-aware memory indexing._transa/_transa_alignedvariant generation invulkan-shaders-gen.cpp._transapipeline entries insideggml-vulkan.cpp.ggml-vulkan.cppcoveringset_tensor,get_tensor,mul_mat_q_f16andmul_mat_vec_q_f16.Environment variable
GGML_VK_NO_TRANSPOSE_A=1: Fully disables the matrix A transpose optimization for all supported quant types. Useful for debugging logic regression and raw baseline performance comparison.Correctness validation
Ran
test-backend-ops -b Vulkan0 -o MUL_MAT; all Q4_K/Q5_K/Q6_K and existing quant types pass correctness checks. SettingGGML_VK_NO_TRANSPOSE_A=1correctly reverts to original baseline performance numbers.Future work
I plan to implement a similar matrix B transpose optimization for the int8 MMQ computation path in a follow-up PR, once the required foundational changes are merged upstream.
Requirements