Skip to content

vulkan: avoid preferring transfer queue on AMD UMA devices#22455

Open
winstonma wants to merge 1 commit into
ggml-org:masterfrom
winstonma:winston/vulkan-uma-transfer-queue
Open

vulkan: avoid preferring transfer queue on AMD UMA devices#22455
winstonma wants to merge 1 commit into
ggml-org:masterfrom
winstonma:winston/vulkan-uma-transfer-queue

Conversation

@winstonma
Copy link
Copy Markdown
Contributor

@winstonma winstonma commented Apr 28, 2026

Overview

On discrete GPUs (dGPUs), a dedicated transfer queue is beneficial because memory is separate from the CPU, so offloading transfers improves throughput. On UMA devices, CPU and GPU share memory, so the extra queue synchronization adds overhead without benefit.

Additional information

Attached the benchmark result running on my device. The benchmark measures the performance impact of the transfer-queue UMA patch by comparing two queue scheduling behaviors in isolated, repeatable conditions.

❯ ./bench-compare-queue.sh --no-build
=== Queue Synchronization Benchmark (Transfer-Queue UMA Patch) ===

This benchmark measures the overhead of Vulkan queue submission and
synchronization, which is what the transfer-queue UMA patch optimizes.
Comparison mode:
  1) Patched default behavior (auto heuristic on UMA)
  2) Forced legacy behavior (GGML_VK_ASYNC_USE_TRANSFER_QUEUE=1)
Trials per mode: 5 (alternating order)

Skipping build step (using existing build)

Running benchmarks...

Trial 1/5
  Running patched default...
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV STRIX1) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (AMD Radeon 890M Graphics (RADV STRIX1))
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen AI 7 PRO 360 w/ Radeon 880M)
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-cpu.so
    Mean: 3.05 ms
  Running forced legacy...
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV STRIX1) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (AMD Radeon 890M Graphics (RADV STRIX1))
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen AI 7 PRO 360 w/ Radeon 880M)
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-cpu.so
    Mean: 3.47 ms

Trial 2/5
  Running forced legacy...
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV STRIX1) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (AMD Radeon 890M Graphics (RADV STRIX1))
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen AI 7 PRO 360 w/ Radeon 880M)
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-cpu.so
    Mean: 3.45 ms
  Running patched default...
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV STRIX1) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (AMD Radeon 890M Graphics (RADV STRIX1))
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen AI 7 PRO 360 w/ Radeon 880M)
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-cpu.so
    Mean: 3.06 ms

Trial 3/5
  Running patched default...
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV STRIX1) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (AMD Radeon 890M Graphics (RADV STRIX1))
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen AI 7 PRO 360 w/ Radeon 880M)
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-cpu.so
    Mean: 3.22 ms
  Running forced legacy...
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV STRIX1) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (AMD Radeon 890M Graphics (RADV STRIX1))
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen AI 7 PRO 360 w/ Radeon 880M)
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-cpu.so
    Mean: 3.23 ms

Trial 4/5
  Running forced legacy...
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV STRIX1) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (AMD Radeon 890M Graphics (RADV STRIX1))
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen AI 7 PRO 360 w/ Radeon 880M)
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-cpu.so
    Mean: 3.37 ms
  Running patched default...
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV STRIX1) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (AMD Radeon 890M Graphics (RADV STRIX1))
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen AI 7 PRO 360 w/ Radeon 880M)
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-cpu.so
    Mean: 3.31 ms

Trial 5/5
  Running patched default...
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV STRIX1) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (AMD Radeon 890M Graphics (RADV STRIX1))
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen AI 7 PRO 360 w/ Radeon 880M)
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-cpu.so
    Mean: 3.18 ms
  Running forced legacy...
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV STRIX1) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (AMD Radeon 890M Graphics (RADV STRIX1))
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen AI 7 PRO 360 w/ Radeon 880M)
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in ~/Code/llama.cpp/build-bench-main/bin/libggml-cpu.so
    Mean: 3.29 ms


=== Summary ===
Patched default means:            3.05 3.06 3.22 3.31 3.18
Forced legacy means:              3.47 3.45 3.23 3.37 3.29
Patched default median:                3.18 ms
Forced legacy median:                  3.37 ms
Median delta vs forced legacy: -5.64% (patched default is faster)

Note: Lower Mean is better. This isolates the transfer-queue decision without requiring an older worktree target.

Requirements

@winstonma winstonma requested a review from a team as a code owner April 28, 2026 02:32
@github-actions github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Apr 28, 2026
@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented May 12, 2026

Neither your reasoning nor your AI-generated test provide any way to confirm or deny your claims. What are you trying to fix/improve? Where can it be measured in actual use?

@winstonma
Copy link
Copy Markdown
Contributor Author

winstonma commented May 12, 2026

Actually I tested this with another PR optimizing the read/write. The combined PR that improve 40% prompt processing on my AMD UMA platform (also verified by another tester).

I couldn't see any overall performance improvement either by this PR or #22462 alone. Only combining these 2 PRs would see prompt processing speed improvement.

So conversely if #22462 is being committed first, then the performance case would be easier to build.

Although disabling the transfer queue should benefit all UMA system but I only have AMD device to test so I place it only on amd system.

For the reference of the reasoning please refer to GPU Memory Pools in D3D12. It discusses the transfer queue performance could hurt performance on the UMA system.

A UMA/integrated device often doesn’t have a dedicated DMA engine, and so any submissions to COPY (TRANSFER) queues will end up getting serialized/flattened onto a single hardware queue.

I made change based on his reasoning, and submit this PR because the result on my device agrees with his blog.

@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented May 12, 2026

That quote is about Intel integrated GPUs, and from before Intel Xe, when their iGPUs were still pretty primitive. AMD APUs do actually have SDMA hardware.

@winstonma
Copy link
Copy Markdown
Contributor Author

winstonma commented May 12, 2026

That quote is about Intel integrated GPUs, and from before Intel Xe, when their iGPUs were still pretty primitive. AMD APUs do actually have SDMA hardware.

Although his point didn't say anything about non-DMA, that's why I test it. Although AMD APUs do actually have SDMA hardware. According to the test result. The prompt processing speed does improve 40% on my machine and 100% on other AMD APU user.

I guess "Two-Step Upload" (Copying from a Staging Buffer to a Default Buffer) vulkan async path has lower performance than direct memcpy + barrier on a UMA path.

@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented May 13, 2026

I cannot measure a difference from this on AMD 8060S (Strix Halo) APU on Linux RADV. Please provide more information about the hardware/softweare configurations where you saw a difference in pp/tg.

@engrtipusultan
Copy link
Copy Markdown

engrtipusultan commented May 13, 2026

That quote is about Intel integrated GPUs, and from before Intel Xe, when their iGPUs were still pretty primitive. AMD APUs do actually have SDMA hardware.

Although his point didn't say anything about non-DMA, that's why I test it. Although AMD APUs do actually have SDMA hardware. According to the test result. The prompt processing speed does improve 40% on my machine and 100% on other AMD APU user.

I guess "Two-Step Upload" (Copying from a Staging Buffer to a Default Buffer) vulkan async path has lower performance than direct memcpy + barrier on a UMA path.

Hi please do not quote my previous bench from other PR. I shared later it was producing gibberish though benchmarks showed 100% improvement.

I merged both of your current PRs that is #22930 and #22455 and there is not improvement on my hardware.

PR:

bash GGML_VK_DISABLE_ASYNC=1 ./llama-bench -m /home/tipu/AI/models/unsloth/Qwen3-Coder-Next/Qwen3-Coder-Next-UD-Q5_K_S-00001-of-00003.gguf -m /home/tipu/AI/models/unsloth/Qwen36-35-A3B/Qwen36-35B-A3B-Q8.gguf -ngl 100 --ubatch-size 1088 --batch-size 2048 --mmap 0 -fa 1 -d 0 -r 3
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none

model size params backend threads n_ubatch fa mmap test t/s
qwen3next 80B.A3B Q5_K - Small 51.98 GiB 79.67 B Vulkan,BLAS 8 1088 1 0 pp512 68.60 ± 1.53
qwen3next 80B.A3B Q5_K - Small 51.98 GiB 79.67 B Vulkan,BLAS 8 1088 1 0 tg128 10.40 ± 0.01
qwen35moe 35B.A3B Q8_0 34.36 GiB 34.66 B Vulkan,BLAS 8 1088 1 0 pp512 113.02 ± 5.08
qwen35moe 35B.A3B Q8_0 34.36 GiB 34.66 B Vulkan,BLAS 8 1088 1 0 tg128 11.92 ± 0.02

build: 671c2f9c7 (9136)

Master:

bash  ./llama-bench -m /home/tipu/AI/models/unsloth/Qwen3-Coder-Next/Qwen3-Coder-Next-UD-Q5_K_S-00001-of-00003.gguf -m /home/tipu/AI/models/unsloth/Qwen36-35-A3B/Qwen36-35B-A3B-Q8.gguf -ngl 100 --ubatch-size 1088 --batch-size 2048 --mmap 0 -fa 1 -d 0 -r 3
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none

model size params backend threads n_ubatch fa mmap test t/s
qwen3next 80B.A3B Q5_K - Small 51.98 GiB 79.67 B Vulkan,BLAS 8 1088 1 0 pp512 68.81 ± 1.71
qwen3next 80B.A3B Q5_K - Small 51.98 GiB 79.67 B Vulkan,BLAS 8 1088 1 0 tg128 10.73 ± 0.01
qwen35moe 35B.A3B Q8_0 34.36 GiB 34.66 B Vulkan,BLAS 8 1088 1 0 pp512 117.55 ± 0.94
qwen35moe 35B.A3B Q8_0 34.36 GiB 34.66 B Vulkan,BLAS 8 1088 1 0 tg128 11.99 ± 0.01

build: 046e284 (9085)

@winstonma
Copy link
Copy Markdown
Contributor Author

@engrtipusultan Seeing no overall performance change is expected. You can also take a look at Four Years Of Kernel Improvement Net 37% Improvement On AMD EPYC. Seeing a significant change in 1/2 commit would be very magical. Some PRs (e.g. #22930) need to use micro-benchmark to take a look in order to see slight performance improvement. But those PRs would count eventually. I believe if PR #22462 is added then you would see performance improvement. It was currently closed because the policy recommend new contributor to submit 1 request only.

I made several edits after your comment. Could you feel free to test #22462 alongside with #22455 and see if gibberish result?

@winstonma
Copy link
Copy Markdown
Contributor Author

I cannot measure a difference from this on AMD 8060S (Strix Halo) APU on Linux RADV. Please provide more information about the hardware/softweare configurations where you saw a difference in pp/tg.

How should I submit the benchmark code? Should I submit on this PR?

@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented May 13, 2026

You can upload it to a Github gist and link it here. I don't actually need much for this PR, the change is not relevant to most devices and the transfer queue code was created for AMD dGPUs, APUs are only affected incidentally. But still, I need to validate what I can.

@mandrakenet
Copy link
Copy Markdown

mandrakenet commented May 13, 2026

Hii, i have tested the both PR on my HX 370 (gfx1150) and i do see a signiticant performance gain, here is the AI generated report based on my tests and the steps i did to compile it
EDIT: with --no-mmap i got gibberish if is there a test that i can run just let me know 👎

@mandrakenet
Copy link
Copy Markdown

Hii, i have tested the both PR on my HX 370 (gfx1150) and i do see a signiticant performance gain, here is the AI generated report based on my tests and the steps i did to compile it EDIT: with --no-mmap i got gibberish if is there a test that i can run just let me know 👎

image

@winstonma
Copy link
Copy Markdown
Contributor Author

winstonma commented May 14, 2026

I cannot measure a difference from this on AMD 8060S (Strix Halo) APU on Linux RADV. Please provide more information about the hardware/softweare configurations where you saw a difference in pp/tg.

After further testing, I have confirmed your observations: merging this specific commit in isolation does not yield a measurable performance difference on my end.

On the other hand it could conclude disabling the transfer queue does not negatively affect performance.

Non-AMD UMA Testing

If possible, could you execute steps 1 through 4 on a non-AMD UMA system?

For step 5, rather than merging this commit directly, please manually modify line 5752 of ggml-vulkan.cpp to ensure async_use_transfer_queue is disabled specifically for UMA, then run the benchmark.

I think it is best to revisit this PR after taking a look at the gibberish output. Just leave this PR here. Thanks

@winstonma
Copy link
Copy Markdown
Contributor Author

winstonma commented May 14, 2026

Hii, i have tested the both PR on my HX 370 (gfx1150) and i do see a signiticant performance gain, here is the AI generated report based on my tests and the steps i did to compile it EDIT: with --no-mmap i got gibberish if is there a test that i can run just let me know 👎

Thanks for finding out the root cause of gibberish output. I am at the still testing PR #22462. And will resubmit once it is ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants