vulkan: opt mul_mat_vecq for mi50 by chraac · Pull Request #22933 · ggml-org/llama.cpp

chraac · 2026-05-11T06:08:01Z

Overview

1. Enable subgroup ops on supported AMD GCN 5.0/5.1 devices

In ggml-vulkan.cpp, this adds a subgroups_gcn_enabled device flag and enables subgroup arithmetic for a small allowlisted set of AMD GPUs based on device name matching.

Previously, AMD GCN devices were excluded from this subgroup path entirely. With this change, supported GCN 5.x devices can use subgroup arithmetic in ggml_vk_load_shaders.

2. Optimize `mul_mat_vecq.comp` for Q4_0

In the Vulkan shader:

add GL_EXT_shader_explicit_arithmetic_types_float16
use half-precision (f16vec2) for the cached ds path when DATA_A_Q4_0 is active
reshape the main iteration logic to reduce repeated index work and make the loop structure friendlier to partial unrolling
keep the non-Q4_0 path unchanged

Performance

Device: MI50 32g
Baseline: f3c3e0e
Optimization: 854c643
Command: ./llama-bench --progress -mmp 0 -r 40 -p 512 -n 128 -m <path_to_gguf>

tg128	Baseline(tk/s)	Optimization(tk/s)	Speedup
qwen35 9B Q4_0	52.24	62.13	1.18x
qwen35moe 35B.A3B Q4_0	61.38	71.30	1.16x

On newer GPUs such as RX 6700, results appear roughly neutral.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure:
YES, Using AI agent for commit log writing, code reviewing

…atrix multiplication

…ation handling

…ng in mul_mat_vecq shader

…larity and performance

… in iter function

…nrolling in iter function" This reverts commit e343296.

…date mmvq_dot_product accordingly

…ations

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

…es GPUs

jeffbolznv · 2026-05-11T14:50:05Z

-vec2 cache_b_ds;
+
+#if defined(DATA_A_Q4_0)
+#define CACHE_VEC_TYPE f16vec2


This would require generating a separate set of shaders for devices that do/don't support float16. Is this really helping performance?

Yes, it provides a ~10% performance improvement on test-backend-ops perf, thought it was worth adding a separate shader config for this f16 internal path. Do we have any existing example PRs I could look at?

Add a fix here to separated the fp16 sources, now we'll only load the fp16 shader when device->f16 is true, could you please have a another look? @jeffbolznv

@0cc4m do you think we could require fp16 support for the mmqv path, to keep this simple?

I'd be hesitant to do that because Nvidia Pascal relies on DP4A for performance and has no FP16 support. And this is too much complexity just to replace a single vec2.

jeffbolznv · 2026-05-11T14:51:19Z

        }

+        // Enable subgroup operations on AMD GCN 5.0/5.1 GPUs
+        static const std::regex s_gcn_regex("^.*(Radeon.*(VII|Vega)|Instinct.*MI(25|50|60)).*$");


Should this check really be looking at specific devices, or is it that some drivers are good at it and some aren't?

Good point.
While we already check the hardware extension flag, the device name check was added as a safeguard to prevent untested devices from hitting this path.
But thought you're right that gating by driver version makes much more sense than hardcoding names. I've confirmed it works on MI50 with pro 26.Q1, but haven't tested older GCN hardware. Would you prefer I update this check to target the driver version instead?

@0cc4m knows better than I which AMD drivers have had issues with subgroup ops. But I assume this will end up being a driver and/or arch check rather than a name check.

Disabling subgroup ops was purely a performance decision, a regex based on the name makes no sense. My tests were on Linux, on Radeon Pro VII (same chip). There is no driver support for these cards on Windows, if you are running there you are using an outdated driver with known issues. A Linux test would be more valid.

jeffbolznv · 2026-05-11T14:58:20Z

-    while (i < unrolled_iters) {
+    const uint b_qs_idx = tid % (32 / K_PER_ITER);
+    uint col = tid * K_PER_ITER;
+    while (num_iters >= 4) {


Why are these changes to the outer loop structure necessary? The inner loop should be easily unrolled regardless.

Good question. My profiling showed that moving b_qs_idx to the outer loop and merging the loop condition variables into the col calculation reduces VGPR usage from 34 to 33. This drop in register pressure increases occupancy, yielding a ~5% performance improvement on test-backend-ops perf on my device (MI50).

This reverts commit b8b5ea6.

…ed functions

chraac and others added 18 commits May 2, 2026 23:45

add hip preset

8baf425

ggml-vulkan: add half-precision dequantization functions and update m…

1ef9ea1

…atrix multiplication

add settings configuration and update Vulkan shader for improved iter…

0fad073

…ation handling

refactor: simplify iter function parameters and optimize loop unrolli…

12c7a06

…ng in mul_mat_vecq shader

refactor: optimize compute_outputs function by using col_stride for c…

61a5fde

…larity and performance

refactor: improve spacing for readability and simplify loop unrolling…

e343296

… in iter function

Merge branch 'master' into dev-mi50

68b930c

Revert "refactor: improve spacing for readability and simplify loop u…

42b9114

…nrolling in iter function" This reverts commit e343296.

refactor: enhance mul_q8_1 function for half-precision support and up…

40e4368

…date mmvq_dot_product accordingly

Merge branch 'master' into dev-mi50

1bc6a29

feat: add support for subgroup operations on AMD GCN GPUs in Vulkan

b6f7d97

wip

b2a9d86

revert unused changes

eb84512

fix: correct subgroup operations enabling condition for AMD GCN GPUs

7871983

fix: correct type casting in mul_q8_1_hf function for accurate calcul…

e2f43bd

…ations

Potential fix for pull request finding

b9f5db9

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

fix: update regex for subgroup operations to include Instinct MI seri…

854c643

…es GPUs

wip

bb76c28

chraac requested a review from a team as a code owner May 11, 2026 06:08

github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels May 11, 2026

fix: unify cache_b_ds definition and initialization for consistency

ca2f39b

jeffbolznv reviewed May 11, 2026

View reviewed changes

chraac added 4 commits May 12, 2026 10:57

fix: using extension ... enable clause for fp16

b8b5ea6

Revert "fix: using extension ... enable clause for fp16"

24b2693

This reverts commit b8b5ea6.

fix: enhance support for float16 in Vulkan shader pipelines and relat…

5422311

…ed functions

rename

3e7843e

chraac requested a review from jeffbolznv May 13, 2026 10:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: opt mul_mat_vecq for mi50#22933

vulkan: opt mul_mat_vecq for mi50#22933
chraac wants to merge 23 commits into
ggml-org:masterfrom
chraac:dev-mi50

chraac commented May 11, 2026 •

edited

Loading

Uh oh!

jeffbolznv May 11, 2026

Uh oh!

chraac May 12, 2026

Uh oh!

chraac May 13, 2026

Uh oh!

jeffbolznv May 13, 2026

Uh oh!

0cc4m May 13, 2026

Uh oh!

jeffbolznv May 11, 2026

Uh oh!

chraac May 11, 2026

Uh oh!

jeffbolznv May 11, 2026

Uh oh!

0cc4m May 11, 2026

Uh oh!

jeffbolznv May 11, 2026

Uh oh!

chraac May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chraac commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

1. Enable subgroup ops on supported AMD GCN 5.0/5.1 devices

2. Optimize mul_mat_vecq.comp for Q4_0

Performance

Requirements

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chraac commented May 11, 2026 •

edited

Loading

2. Optimize `mul_mat_vecq.comp` for Q4_0