vulkan: avoid preferring transfer queue on AMD UMA devices#22455
vulkan: avoid preferring transfer queue on AMD UMA devices#22455winstonma wants to merge 1 commit into
Conversation
|
Neither your reasoning nor your AI-generated test provide any way to confirm or deny your claims. What are you trying to fix/improve? Where can it be measured in actual use? |
|
Actually I tested this with another PR optimizing the read/write. The combined PR that improve 40% prompt processing on my AMD UMA platform (also verified by another tester). I couldn't see any overall performance improvement either by this PR or #22462 alone. Only combining these 2 PRs would see prompt processing speed improvement. So conversely if #22462 is being committed first, then the performance case would be easier to build. Although disabling the transfer queue should benefit all UMA system but I only have AMD device to test so I place it only on amd system. For the reference of the reasoning please refer to GPU Memory Pools in D3D12. It discusses the transfer queue performance could hurt performance on the UMA system.
I made change based on his reasoning, and submit this PR because the result on my device agrees with his blog. |
|
That quote is about Intel integrated GPUs, and from before Intel Xe, when their iGPUs were still pretty primitive. AMD APUs do actually have SDMA hardware. |
Although his point didn't say anything about non-DMA, that's why I test it. Although AMD APUs do actually have SDMA hardware. According to the test result. The prompt processing speed does improve 40% on my machine and 100% on other AMD APU user. I guess "Two-Step Upload" (Copying from a Staging Buffer to a Default Buffer) vulkan async path has lower performance than direct memcpy + barrier on a UMA path. |
|
I cannot measure a difference from this on AMD 8060S (Strix Halo) APU on Linux RADV. Please provide more information about the hardware/softweare configurations where you saw a difference in pp/tg. |
Hi please do not quote my previous bench from other PR. I shared later it was producing gibberish though benchmarks showed 100% improvement. I merged both of your current PRs that is #22930 and #22455 and there is not improvement on my hardware. PR: bash GGML_VK_DISABLE_ASYNC=1 ./llama-bench -m /home/tipu/AI/models/unsloth/Qwen3-Coder-Next/Qwen3-Coder-Next-UD-Q5_K_S-00001-of-00003.gguf -m /home/tipu/AI/models/unsloth/Qwen36-35-A3B/Qwen36-35B-A3B-Q8.gguf -ngl 100 --ubatch-size 1088 --batch-size 2048 --mmap 0 -fa 1 -d 0 -r 3
build: 671c2f9c7 (9136) Master: bash ./llama-bench -m /home/tipu/AI/models/unsloth/Qwen3-Coder-Next/Qwen3-Coder-Next-UD-Q5_K_S-00001-of-00003.gguf -m /home/tipu/AI/models/unsloth/Qwen36-35-A3B/Qwen36-35B-A3B-Q8.gguf -ngl 100 --ubatch-size 1088 --batch-size 2048 --mmap 0 -fa 1 -d 0 -r 3
build: 046e284 (9085) |
|
@engrtipusultan Seeing no overall performance change is expected. You can also take a look at Four Years Of Kernel Improvement Net 37% Improvement On AMD EPYC. Seeing a significant change in 1/2 commit would be very magical. Some PRs (e.g. #22930) need to use micro-benchmark to take a look in order to see slight performance improvement. But those PRs would count eventually. I believe if PR #22462 is added then you would see performance improvement. It was currently closed because the policy recommend new contributor to submit 1 request only. I made several edits after your comment. Could you feel free to test #22462 alongside with #22455 and see if gibberish result? |
How should I submit the benchmark code? Should I submit on this PR? |
|
You can upload it to a Github gist and link it here. I don't actually need much for this PR, the change is not relevant to most devices and the transfer queue code was created for AMD dGPUs, APUs are only affected incidentally. But still, I need to validate what I can. |
|
Hii, i have tested the both PR on my HX 370 (gfx1150) and i do see a signiticant performance gain, here is the AI generated report based on my tests and the steps i did to compile it |
|
After further testing, I have confirmed your observations: merging this specific commit in isolation does not yield a measurable performance difference on my end. On the other hand it could conclude disabling the transfer queue does not negatively affect performance. Non-AMD UMA TestingIf possible, could you execute steps 1 through 4 on a non-AMD UMA system? For step 5, rather than merging this commit directly, please manually modify line 5752 of ggml-vulkan.cpp to ensure I think it is best to revisit this PR after taking a look at the gibberish output. Just leave this PR here. Thanks |
Thanks for finding out the root cause of gibberish output. I am at the still testing PR #22462. And will resubmit once it is ready. |

Overview
On discrete GPUs (dGPUs), a dedicated transfer queue is beneficial because memory is separate from the CPU, so offloading transfers improves throughput. On UMA devices, CPU and GPU share memory, so the extra queue synchronization adds overhead without benefit.
Additional information
Attached the benchmark result running on my device. The benchmark measures the performance impact of the transfer-queue UMA patch by comparing two queue scheduling behaviors in isolated, repeatable conditions.
Requirements