Skip to content

vulkan: opt mul_mat_vecq for mi50#22933

Open
chraac wants to merge 23 commits into
ggml-org:masterfrom
chraac:dev-mi50
Open

vulkan: opt mul_mat_vecq for mi50#22933
chraac wants to merge 23 commits into
ggml-org:masterfrom
chraac:dev-mi50

Conversation

@chraac
Copy link
Copy Markdown
Contributor

@chraac chraac commented May 11, 2026

Overview

1. Enable subgroup ops on supported AMD GCN 5.0/5.1 devices

In ggml-vulkan.cpp, this adds a subgroups_gcn_enabled device flag and enables subgroup arithmetic for a small allowlisted set of AMD GPUs based on device name matching.

Previously, AMD GCN devices were excluded from this subgroup path entirely. With this change, supported GCN 5.x devices can use subgroup arithmetic in ggml_vk_load_shaders.

2. Optimize mul_mat_vecq.comp for Q4_0

In the Vulkan shader:

  • add GL_EXT_shader_explicit_arithmetic_types_float16
  • use half-precision (f16vec2) for the cached ds path when DATA_A_Q4_0 is active
  • reshape the main iteration logic to reduce repeated index work and make the loop structure friendlier to partial unrolling
  • keep the non-Q4_0 path unchanged

Performance

  • Device: MI50 32g
  • Baseline: f3c3e0e
  • Optimization: 854c643
  • Command: ./llama-bench --progress -mmp 0 -r 40 -p 512 -n 128 -m <path_to_gguf>
tg128 Baseline(tk/s) Optimization(tk/s) Speedup
qwen35 9B Q4_0 52.24 62.13 1.18x
qwen35moe 35B.A3B Q4_0 61.38 71.30 1.16x

On newer GPUs such as RX 6700, results appear roughly neutral.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure:
    YES, Using AI agent for commit log writing, code reviewing

chraac and others added 18 commits May 2, 2026 23:45
…nrolling in iter function"

This reverts commit e343296.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@chraac chraac requested a review from a team as a code owner May 11, 2026 06:08
@github-actions github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels May 11, 2026
vec2 cache_b_ds;

#if defined(DATA_A_Q4_0)
#define CACHE_VEC_TYPE f16vec2
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would require generating a separate set of shaders for devices that do/don't support float16. Is this really helping performance?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it provides a ~10% performance improvement on test-backend-ops perf, thought it was worth adding a separate shader config for this f16 internal path. Do we have any existing example PRs I could look at?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a fix here to separated the fp16 sources, now we'll only load the fp16 shader when device->f16 is true, could you please have a another look? @jeffbolznv

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@0cc4m do you think we could require fp16 support for the mmqv path, to keep this simple?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be hesitant to do that because Nvidia Pascal relies on DP4A for performance and has no FP16 support. And this is too much complexity just to replace a single vec2.

}

// Enable subgroup operations on AMD GCN 5.0/5.1 GPUs
static const std::regex s_gcn_regex("^.*(Radeon.*(VII|Vega)|Instinct.*MI(25|50|60)).*$");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this check really be looking at specific devices, or is it that some drivers are good at it and some aren't?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.
While we already check the hardware extension flag, the device name check was added as a safeguard to prevent untested devices from hitting this path.
But thought you're right that gating by driver version makes much more sense than hardcoding names. I've confirmed it works on MI50 with pro 26.Q1, but haven't tested older GCN hardware. Would you prefer I update this check to target the driver version instead?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@0cc4m knows better than I which AMD drivers have had issues with subgroup ops. But I assume this will end up being a driver and/or arch check rather than a name check.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disabling subgroup ops was purely a performance decision, a regex based on the name makes no sense. My tests were on Linux, on Radeon Pro VII (same chip). There is no driver support for these cards on Windows, if you are running there you are using an outdated driver with known issues. A Linux test would be more valid.

while (i < unrolled_iters) {
const uint b_qs_idx = tid % (32 / K_PER_ITER);
uint col = tid * K_PER_ITER;
while (num_iters >= 4) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these changes to the outer loop structure necessary? The inner loop should be easily unrolled regardless.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. My profiling showed that moving b_qs_idx to the outer loop and merging the loop condition variables into the col calculation reduces VGPR usage from 34 to 33. This drop in register pressure increases occupancy, yielding a ~5% performance improvement on test-backend-ops perf on my device (MI50).

@chraac chraac requested a review from jeffbolznv May 13, 2026 10:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants