vulkan: opt mul_mat_vecq for mi50#22933
Conversation
…atrix multiplication
…ng in mul_mat_vecq shader
…larity and performance
… in iter function
…nrolling in iter function" This reverts commit e343296.
…date mmvq_dot_product accordingly
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
| vec2 cache_b_ds; | ||
|
|
||
| #if defined(DATA_A_Q4_0) | ||
| #define CACHE_VEC_TYPE f16vec2 |
There was a problem hiding this comment.
This would require generating a separate set of shaders for devices that do/don't support float16. Is this really helping performance?
There was a problem hiding this comment.
Yes, it provides a ~10% performance improvement on test-backend-ops perf, thought it was worth adding a separate shader config for this f16 internal path. Do we have any existing example PRs I could look at?
There was a problem hiding this comment.
Add a fix here to separated the fp16 sources, now we'll only load the fp16 shader when device->f16 is true, could you please have a another look? @jeffbolznv
There was a problem hiding this comment.
@0cc4m do you think we could require fp16 support for the mmqv path, to keep this simple?
There was a problem hiding this comment.
I'd be hesitant to do that because Nvidia Pascal relies on DP4A for performance and has no FP16 support. And this is too much complexity just to replace a single vec2.
| } | ||
|
|
||
| // Enable subgroup operations on AMD GCN 5.0/5.1 GPUs | ||
| static const std::regex s_gcn_regex("^.*(Radeon.*(VII|Vega)|Instinct.*MI(25|50|60)).*$"); |
There was a problem hiding this comment.
Should this check really be looking at specific devices, or is it that some drivers are good at it and some aren't?
There was a problem hiding this comment.
Good point.
While we already check the hardware extension flag, the device name check was added as a safeguard to prevent untested devices from hitting this path.
But thought you're right that gating by driver version makes much more sense than hardcoding names. I've confirmed it works on MI50 with pro 26.Q1, but haven't tested older GCN hardware. Would you prefer I update this check to target the driver version instead?
There was a problem hiding this comment.
@0cc4m knows better than I which AMD drivers have had issues with subgroup ops. But I assume this will end up being a driver and/or arch check rather than a name check.
There was a problem hiding this comment.
Disabling subgroup ops was purely a performance decision, a regex based on the name makes no sense. My tests were on Linux, on Radeon Pro VII (same chip). There is no driver support for these cards on Windows, if you are running there you are using an outdated driver with known issues. A Linux test would be more valid.
| while (i < unrolled_iters) { | ||
| const uint b_qs_idx = tid % (32 / K_PER_ITER); | ||
| uint col = tid * K_PER_ITER; | ||
| while (num_iters >= 4) { |
There was a problem hiding this comment.
Why are these changes to the outer loop structure necessary? The inner loop should be easily unrolled regardless.
There was a problem hiding this comment.
Good question. My profiling showed that moving b_qs_idx to the outer loop and merging the loop condition variables into the col calculation reduces VGPR usage from 34 to 33. This drop in register pressure increases occupancy, yielding a ~5% performance improvement on test-backend-ops perf on my device (MI50).
Overview
1. Enable subgroup ops on supported AMD GCN 5.0/5.1 devices
In
ggml-vulkan.cpp, this adds asubgroups_gcn_enableddevice flag and enables subgroup arithmetic for a small allowlisted set of AMD GPUs based on device name matching.Previously, AMD GCN devices were excluded from this subgroup path entirely. With this change, supported GCN 5.x devices can use subgroup arithmetic in
ggml_vk_load_shaders.2. Optimize
mul_mat_vecq.compfor Q4_0In the Vulkan shader:
GL_EXT_shader_explicit_arithmetic_types_float16f16vec2) for the cacheddspath whenDATA_A_Q4_0is activeQ4_0path unchangedPerformance
MI50 32gf3c3e0e854c643./llama-bench --progress -mmp 0 -r 40 -p 512 -n 128 -m <path_to_gguf>On newer GPUs such as RX 6700, results appear roughly neutral.
Requirements
YES, Using AI agent for commit log writing, code reviewing