vulkan: avoid preferring transfer queue on AMD UMA devices by winstonma · Pull Request #22455 · ggml-org/llama.cpp

winstonma · 2026-04-28T02:32:21Z

Overview

On discrete GPUs (dGPUs), a dedicated transfer queue is beneficial because memory is separate from the CPU, so offloading transfers improves throughput. On UMA devices, CPU and GPU share memory, so the extra queue synchronization adds overhead without benefit.

Additional information

Attached the benchmark result running on my device. The benchmark measures the performance impact of the transfer-queue UMA patch by comparing two queue scheduling behaviors in isolated, repeatable conditions.

❯ ./bench-compare-queue.sh --no-build
=== Queue Synchronization Benchmark (Transfer-Queue UMA Patch) ===

This benchmark measures the overhead of Vulkan queue submission and
synchronization, which is what the transfer-queue UMA patch optimizes.
Comparison mode:
  1) Patched default behavior (auto heuristic on UMA)
  2) Forced legacy behavior (GGML_VK_ASYNC_USE_TRANSFER_QUEUE=1)
Trials per mode: 5 (alternating order)

Skipping build step (using existing build)

Running benchmarks...

Trial 1/5
  Running patched default...
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV STRIX1) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (AMD Radeon 890M Graphics (RADV STRIX1))
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen AI 7 PRO 360 w/ Radeon 880M)
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-cpu.so
    Mean: 3.05 ms
  Running forced legacy...
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV STRIX1) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (AMD Radeon 890M Graphics (RADV STRIX1))
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen AI 7 PRO 360 w/ Radeon 880M)
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-cpu.so
    Mean: 3.47 ms

Trial 2/5
  Running forced legacy...
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV STRIX1) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (AMD Radeon 890M Graphics (RADV STRIX1))
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen AI 7 PRO 360 w/ Radeon 880M)
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-cpu.so
    Mean: 3.45 ms
  Running patched default...
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV STRIX1) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (AMD Radeon 890M Graphics (RADV STRIX1))
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen AI 7 PRO 360 w/ Radeon 880M)
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-cpu.so
    Mean: 3.06 ms

Trial 3/5
  Running patched default...
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV STRIX1) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (AMD Radeon 890M Graphics (RADV STRIX1))
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen AI 7 PRO 360 w/ Radeon 880M)
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-cpu.so
    Mean: 3.22 ms
  Running forced legacy...
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV STRIX1) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (AMD Radeon 890M Graphics (RADV STRIX1))
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen AI 7 PRO 360 w/ Radeon 880M)
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-cpu.so
    Mean: 3.23 ms

Trial 4/5
  Running forced legacy...
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV STRIX1) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (AMD Radeon 890M Graphics (RADV STRIX1))
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen AI 7 PRO 360 w/ Radeon 880M)
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-cpu.so
    Mean: 3.37 ms
  Running patched default...
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV STRIX1) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (AMD Radeon 890M Graphics (RADV STRIX1))
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen AI 7 PRO 360 w/ Radeon 880M)
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-cpu.so
    Mean: 3.31 ms

Trial 5/5
  Running patched default...
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV STRIX1) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (AMD Radeon 890M Graphics (RADV STRIX1))
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen AI 7 PRO 360 w/ Radeon 880M)
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-cpu.so
    Mean: 3.18 ms
  Running forced legacy...
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV STRIX1) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (AMD Radeon 890M Graphics (RADV STRIX1))
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen AI 7 PRO 360 w/ Radeon 880M)
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-cpu.so
    Mean: 3.29 ms


=== Summary ===
Patched default means:            3.05 3.06 3.22 3.31 3.18
Forced legacy means:              3.47 3.45 3.23 3.37 3.29
Patched default median:                3.18 ms
Forced legacy median:                  3.37 ms
Median delta vs forced legacy: -5.64% (patched default is faster)

Note: Lower Mean is better. This isolates the transfer-queue decision without requiring an older worktree target.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES

0cc4m · 2026-05-12T09:27:07Z

Neither your reasoning nor your AI-generated test provide any way to confirm or deny your claims. What are you trying to fix/improve? Where can it be measured in actual use?

winstonma · 2026-05-12T11:08:37Z

Actually I tested this with another PR optimizing the read/write. The combined PR that improve 40% prompt processing on my AMD UMA platform (also verified by another tester).

I couldn't see any overall performance improvement either by this PR or #22462 alone. Only combining these 2 PRs would see prompt processing speed improvement.

So conversely if #22462 is being committed first, then the performance case would be easier to build.

Although disabling the transfer queue should benefit all UMA system but I only have AMD device to test so I place it only on amd system.

For the reference of the reasoning please refer to GPU Memory Pools in D3D12. It discusses the transfer queue performance could hurt performance on the UMA system.

A UMA/integrated device often doesn’t have a dedicated DMA engine, and so any submissions to COPY (TRANSFER) queues will end up getting serialized/flattened onto a single hardware queue.

I made change based on his reasoning, and submit this PR because the result on my device agrees with his blog.

0cc4m · 2026-05-12T13:36:55Z

That quote is about Intel integrated GPUs, and from before Intel Xe, when their iGPUs were still pretty primitive. AMD APUs do actually have SDMA hardware.

winstonma · 2026-05-12T14:39:03Z

That quote is about Intel integrated GPUs, and from before Intel Xe, when their iGPUs were still pretty primitive. AMD APUs do actually have SDMA hardware.

Although his point didn't say anything about non-DMA, that's why I test it. Although AMD APUs do actually have SDMA hardware. According to the test result. The prompt processing speed does improve 40% on my machine and 100% on other AMD APU user.

I guess "Two-Step Upload" (Copying from a Staging Buffer to a Default Buffer) vulkan async path has lower performance than direct memcpy + barrier on a UMA path.

0cc4m · 2026-05-13T09:27:03Z

I cannot measure a difference from this on AMD 8060S (Strix Halo) APU on Linux RADV. Please provide more information about the hardware/softweare configurations where you saw a difference in pp/tg.

engrtipusultan · 2026-05-13T10:12:20Z

That quote is about Intel integrated GPUs, and from before Intel Xe, when their iGPUs were still pretty primitive. AMD APUs do actually have SDMA hardware.

Although his point didn't say anything about non-DMA, that's why I test it. Although AMD APUs do actually have SDMA hardware. According to the test result. The prompt processing speed does improve 40% on my machine and 100% on other AMD APU user.

I guess "Two-Step Upload" (Copying from a Staging Buffer to a Default Buffer) vulkan async path has lower performance than direct memcpy + barrier on a UMA path.

Hi please do not quote my previous bench from other PR. I shared later it was producing gibberish though benchmarks showed 100% improvement.

I merged both of your current PRs that is #22930 and #22455 and there is not improvement on my hardware.

PR:

bash GGML_VK_DISABLE_ASYNC=1 ./llama-bench -m /home/tipu/AI/models/unsloth/Qwen3-Coder-Next/Qwen3-Coder-Next-UD-Q5_K_S-00001-of-00003.gguf -m /home/tipu/AI/models/unsloth/Qwen36-35-A3B/Qwen36-35B-A3B-Q8.gguf -ngl 100 --ubatch-size 1088 --batch-size 2048 --mmap 0 -fa 1 -d 0 -r 3
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none

model	size	params	backend	threads	n_ubatch	fa	test	t/s
qwen3next 80B.A3B Q5_K - Small	51.98 GiB	79.67 B	Vulkan,BLAS	8	1088	1	pp512	68.60 ± 1.53
qwen3next 80B.A3B Q5_K - Small	51.98 GiB	79.67 B	Vulkan,BLAS	8	1088	1	tg128	10.40 ± 0.01
qwen35moe 35B.A3B Q8_0	34.36 GiB	34.66 B	Vulkan,BLAS	8	1088	1	pp512	113.02 ± 5.08
qwen35moe 35B.A3B Q8_0	34.36 GiB	34.66 B	Vulkan,BLAS	8	1088	1	tg128	11.92 ± 0.02

build: 671c2f9c7 (9136)

Master:

bash  ./llama-bench -m /home/tipu/AI/models/unsloth/Qwen3-Coder-Next/Qwen3-Coder-Next-UD-Q5_K_S-00001-of-00003.gguf -m /home/tipu/AI/models/unsloth/Qwen36-35-A3B/Qwen36-35B-A3B-Q8.gguf -ngl 100 --ubatch-size 1088 --batch-size 2048 --mmap 0 -fa 1 -d 0 -r 3
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none

model	size	params	backend	threads	n_ubatch	fa	test	t/s
qwen3next 80B.A3B Q5_K - Small	51.98 GiB	79.67 B	Vulkan,BLAS	8	1088	1	pp512	68.81 ± 1.71
qwen3next 80B.A3B Q5_K - Small	51.98 GiB	79.67 B	Vulkan,BLAS	8	1088	1	tg128	10.73 ± 0.01
qwen35moe 35B.A3B Q8_0	34.36 GiB	34.66 B	Vulkan,BLAS	8	1088	1	pp512	117.55 ± 0.94
qwen35moe 35B.A3B Q8_0	34.36 GiB	34.66 B	Vulkan,BLAS	8	1088	1	tg128	11.99 ± 0.01

build: 046e284 (9085)

winstonma · 2026-05-13T11:03:58Z

@engrtipusultan Seeing no overall performance change is expected. You can also take a look at Four Years Of Kernel Improvement Net 37% Improvement On AMD EPYC. Seeing a significant change in 1/2 commit would be very magical. Some PRs (e.g. #22930) need to use micro-benchmark to take a look in order to see slight performance improvement. But those PRs would count eventually. I believe if PR #22462 is added then you would see performance improvement. It was currently closed because the policy recommend new contributor to submit 1 request only.

I made several edits after your comment. Could you feel free to test #22462 alongside with #22455 and see if gibberish result?

winstonma · 2026-05-13T11:04:27Z

I cannot measure a difference from this on AMD 8060S (Strix Halo) APU on Linux RADV. Please provide more information about the hardware/softweare configurations where you saw a difference in pp/tg.

How should I submit the benchmark code? Should I submit on this PR?

0cc4m · 2026-05-13T11:10:07Z

You can upload it to a Github gist and link it here. I don't actually need much for this PR, the change is not relevant to most devices and the transfer queue code was created for AMD dGPUs, APUs are only affected incidentally. But still, I need to validate what I can.

mandrakenet · 2026-05-13T21:09:38Z

Hii, i have tested the both PR on my HX 370 (gfx1150) and i do see a signiticant performance gain, here is the AI generated report based on my tests and the steps i did to compile it
EDIT: with --no-mmap i got gibberish if is there a test that i can run just let me know 👎

mandrakenet · 2026-05-14T00:02:41Z

Hii, i have tested the both PR on my HX 370 (gfx1150) and i do see a signiticant performance gain, here is the AI generated report based on my tests and the steps i did to compile it EDIT: with --no-mmap i got gibberish if is there a test that i can run just let me know 👎

winstonma · 2026-05-14T01:41:50Z

I cannot measure a difference from this on AMD 8060S (Strix Halo) APU on Linux RADV. Please provide more information about the hardware/softweare configurations where you saw a difference in pp/tg.

After further testing, I have confirmed your observations: merging this specific commit in isolation does not yield a measurable performance difference on my end.

On the other hand it could conclude disabling the transfer queue does not negatively affect performance.

Non-AMD UMA Testing

If possible, could you execute steps 1 through 4 on a non-AMD UMA system?

For step 5, rather than merging this commit directly, please manually modify line 5752 of ggml-vulkan.cpp to ensure async_use_transfer_queue is disabled specifically for UMA, then run the benchmark.

I think it is best to revisit this PR after taking a look at the gibberish output. Just leave this PR here. Thanks

winstonma · 2026-05-14T01:55:52Z

Hii, i have tested the both PR on my HX 370 (gfx1150) and i do see a signiticant performance gain, here is the AI generated report based on my tests and the steps i did to compile it EDIT: with --no-mmap i got gibberish if is there a test that i can run just let me know 👎

Thanks for finding out the root cause of gibberish output. I am at the still testing PR #22462. And will resubmit once it is ready.

vulkan: avoid preferring transfer queue on AMD UMA devices

80e793a

winstonma requested a review from a team as a code owner April 28, 2026 02:32

github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Apr 28, 2026

engrtipusultan mentioned this pull request Apr 28, 2026

Optimize Vulkan buffer transfers on UMA (Unified Memory Architecture) devices #22462

Closed

Conversation

winstonma commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

0cc4m commented May 12, 2026

Uh oh!

winstonma commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented May 12, 2026

Uh oh!

winstonma commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented May 13, 2026

Uh oh!

engrtipusultan commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

winstonma commented May 13, 2026

Uh oh!

winstonma commented May 13, 2026

Uh oh!

0cc4m commented May 13, 2026

Uh oh!

mandrakenet commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mandrakenet commented May 14, 2026

Uh oh!

winstonma commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Non-AMD UMA Testing

Uh oh!

winstonma commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

winstonma commented Apr 28, 2026 •

edited

Loading

winstonma commented May 12, 2026 •

edited

Loading

winstonma commented May 12, 2026 •

edited

Loading

engrtipusultan commented May 13, 2026 •

edited

Loading

mandrakenet commented May 13, 2026 •

edited

Loading

winstonma commented May 14, 2026 •

edited

Loading

winstonma commented May 14, 2026 •

edited

Loading