Optimize Vulkan buffer transfers on UMA (Unified Memory Architecture) devices by winstonma · Pull Request #22462 · ggml-org/llama.cpp

winstonma · 2026-04-28T08:48:10Z

Overview

This PR optimizes Vulkan buffer transfers on UMA (Unified Memory Architecture) devices by bypassing GPU staging buffers when possible and using direct CPU memory access instead. The changes target situations where GPU and CPU memory are physically the same, making direct copies more efficient.

Additional information

This is the ran benchmark result:

Write Transfer Time Test (`ggml_backend_tensor_set`):

Size	CPU	GPU	Winner
64 MB	2874.80	4482.24	CPU
128 MB	5631.16	8411.59	CPU
256 MB	11179.18	17056.37	CPU
1 GB	43660.37	65976.04	CPU
2 GB	87184.07	132112.14	CPU

Because the CPU outperforms the GPU even during large transfers, the CPU write path has replaced the GPU write path for UMA configurations.

Read Transfer Time Test (`ggml_backend_tensor_get`):

Size	CPU	GPU	Winner
2 KB	9.17	73.84	CPU
4 KB	18.88	77.11	CPU
8 KB	38.04	88.56	CPU
16 KB	76.53	108.02	CPU
32 KB	153.22	123.70	GPU
64 KB	333.40	121.90	GPU
128 KB	671.40	135.83	GPU
256 KB	1329.77	165.87	GPU
1024 KB	4973.79	289.40	GPU

To optimize read performance, a calibration function identifies the specific crossover size for read transfers. This process, applicable only to the UMA (Uniform Memory Access) path, requires an additional 100ms to execute. Consequently, the default read threshold has been established at 16KB.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, for finding/implementing/bench-marking UMA optimization

… devices

ggml-gh-bot · 2026-04-28T08:53:47Z

Hi @winstonma, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 2 open PRs.
AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

winstonma · 2026-04-28T09:50:50Z

So how can I get this PR reviewed? Thanks

engrtipusultan · 2026-04-28T10:57:56Z

I have ryzen-7-5825u with Vega 8. I am seeing almost 200% packet processing increase. Thank you very much.

Master with #22462 #22455 and #21751 merged.

bash  ./llama-bench -m /home/tipu/AI/models/unsloth/Qwen3-Coder-Next/Qwen3-Coder-Next-UD-Q5_K_S-00001-of-00003.gguf -m /home/tipu/AI/models/unsloth/Qwen36-35-A3B/Qwen36-35B-A3B-Q8.gguf -ngl 100 --ubatch-size 1088 --batch-size 2048 --mmap 0 -fa 1 -d 0,8096 -r 3
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none

model	size	params	backend	threads	n_ubatch	fa	test	t/s
qwen3next 80B.A3B Q5_K - Small	51.98 GiB	79.67 B	Vulkan,BLAS	8	1088	1	pp512	141.67 ± 0.03
qwen3next 80B.A3B Q5_K - Small	51.98 GiB	79.67 B	Vulkan,BLAS	8	1088	1	tg128	10.72 ± 0.00
qwen3next 80B.A3B Q5_K - Small	51.98 GiB	79.67 B	Vulkan,BLAS	8	1088	1	pp512 @ d8096	105.16 ± 0.05
qwen3next 80B.A3B Q5_K - Small	51.98 GiB	79.67 B	Vulkan,BLAS	8	1088	1	tg128 @ d8096	10.17 ± 0.01
qwen35moe 35B.A3B Q8_0	34.36 GiB	34.66 B	Vulkan,BLAS	8	1088	1	pp512	200.42 ± 0.83
qwen35moe 35B.A3B Q8_0	34.36 GiB	34.66 B	Vulkan,BLAS	8	1088	1	tg128	11.98 ± 0.03
qwen35moe 35B.A3B Q8_0	34.36 GiB	34.66 B	Vulkan,BLAS	8	1088	1	pp512 @ d8096	142.51 ± 0.13
qwen35moe 35B.A3B Q8_0	34.36 GiB	34.66 B	Vulkan,BLAS	8	1088	1	tg128 @ d8096	11.47 ± 0.00

build: 4e522bfe4 (8961)

Original Master:

bash  ./llama-bench -m /home/tipu/AI/models/unsloth/Qwen3-Coder-Next/Qwen3-Coder-Next-UD-Q5_K_S-00001-of-00003.gguf -m /home/tipu/AI/models/unsloth/Qwen36-35-A3B/Qwen36-35B-A3B-Q8.gguf -ngl 100 --ubatch-size 1088 --batch-size 2048 --mmap 0 -fa 1 -d 0,8096 -r 3
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none

model	size	params	backend	threads	n_ubatch	fa	test	t/s
qwen3next 80B.A3B Q5_K - Small	51.98 GiB	79.67 B	Vulkan,BLAS	8	1088	1	pp512	70.44 ± 1.19
qwen3next 80B.A3B Q5_K - Small	51.98 GiB	79.67 B	Vulkan,BLAS	8	1088	1	tg128	10.78 ± 0.01
qwen3next 80B.A3B Q5_K - Small	51.98 GiB	79.67 B	Vulkan,BLAS	8	1088	1	pp512 @ d8096	60.12 ± 0.65
qwen3next 80B.A3B Q5_K - Small	51.98 GiB	79.67 B	Vulkan,BLAS	8	1088	1	tg128 @ d8096	10.22 ± 0.00
qwen35moe 35B.A3B Q8_0	34.36 GiB	34.66 B	Vulkan,BLAS	8	1088	1	pp512	120.22 ± 2.95
qwen35moe 35B.A3B Q8_0	34.36 GiB	34.66 B	Vulkan,BLAS	8	1088	1	tg128	12.01 ± 0.02
qwen35moe 35B.A3B Q8_0	34.36 GiB	34.66 B	Vulkan,BLAS	8	1088	1	pp512 @ d8096	90.34 ± 2.31
qwen35moe 35B.A3B Q8_0	34.36 GiB	34.66 B	Vulkan,BLAS	8	1088	1	tg128 @ d8096	11.37 ± 0.04

build: b1a5bd4 (8938)

engrtipusultan · 2026-04-28T11:12:20Z

I was too quick to get excited. Benchmarks are wild but output is gibberish on all models. Reverting.

winstonma · 2026-04-28T11:24:50Z

Okie I will take a look

I just ran llama-bench and didn't ran llama-cli to check the output

winstonma · 2026-04-28T13:12:58Z

@engrtipusultan I ran the LLM model but I couldn't repeat what you saw.

The LLM output is fine
The llama-bench result on my machine is more or less the same

Did you see any good result after reverting only this commit?

Here is the llama-bench result on my machine:

Using version 8966:

❯ llama-bench -m ~/model/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf -ngl 100 --ubatch-size 1088 --batch-size 2048 --mmap 0 -fa 1 -d 0,8096 -r 3
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV STRIX1) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| gemma4 26B.A4B Q4_K - Medium   |  15.90 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |           pp512 |        343.84 ± 1.55 |
| gemma4 26B.A4B Q4_K - Medium   |  15.90 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |           tg128 |         20.88 ± 0.09 |
| gemma4 26B.A4B Q4_K - Medium   |  15.90 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |   pp512 @ d8096 |        280.71 ± 0.91 |
| gemma4 26B.A4B Q4_K - Medium   |  15.90 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |   tg128 @ d8096 |         18.56 ± 0.03 |

build: 7b8443ac7 (8966)

With this PR:

❯ llama-bench -m ~/model/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf -ngl 100 --ubatch-size 1088 --batch-size 2048 --mmap 0 -fa 1 -d 0,8096 -r 3
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV STRIX1) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| gemma4 26B.A4B Q4_K - Medium   |  15.90 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |           pp512 |        342.38 ± 2.64 |
| gemma4 26B.A4B Q4_K - Medium   |  15.90 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |           tg128 |         21.20 ± 0.04 |
| gemma4 26B.A4B Q4_K - Medium   |  15.90 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |   pp512 @ d8096 |        283.93 ± 1.25 |
| gemma4 26B.A4B Q4_K - Medium   |  15.90 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |   tg128 @ d8096 |         18.07 ± 0.52 |

build: 7b8443ac7 (8966)

engrtipusultan · 2026-04-28T14:00:51Z

Yes reverting to latest master resolves the issue. So it is one of your two PRs that caused it. I checked on llama-server like shown in screenshots.

engrtipusultan · 2026-04-28T14:02:53Z

If you want, tomorrow, I can check both PRs one by one

Adds a configurable threshold via env var: GGML_VK_UMA_NON_CACHED_DIRECT_READ_THRESHOLD (default now 512 * 1024). Introduces ggml_vk_uma_non_cached_direct_read_threshold() to parse/cache that env var once, with validation and warning logs on invalid/overflow values. Introduces ggml_vk_use_uma_direct_read(vk_buffer &, size_t) to centralize the direct-read decision logic. Replaces duplicated inline heuristics in three read paths with the shared helper: - ggml_vk_buffer_read_2d_async() - ggml_vk_buffer_read() - ggml_backend_vk_get_tensor_async() Keeps the small non-cached UMA async behavior explicit: if direct read is not preferred and sync staging is unavailable, it returns false so caller falls back. Adds needed headers for parsing/error handling: <cstdlib> and <cerrno>.

winstonma · 2026-04-28T17:11:59Z

@engrtipusultan I just updated the PR code. Could you please see if it break on your side?

From the performance perspective I don't see a huge difference on the pp and tg performance. I would consider this as a micro-optimization for UMA device.

arch-btw · 2026-04-28T18:23:22Z

This PR (#22462)

./llama-bench -m google_gemma-4-26B-A4B-it-Q4_K_L.gguf -ngl 100 --ubatch-size 1088 --batch-size 2048 --mmap 0 -fa 1 -d 0,8096 -r 3
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 780M Graphics (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| gemma4 26B.A4B Q4_K - Medium   |  16.02 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |           pp512 |        170.95 ± 0.88 |
| gemma4 26B.A4B Q4_K - Medium   |  16.02 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |           tg128 |         20.60 ± 0.04 |
| gemma4 26B.A4B Q4_K - Medium   |  16.02 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |   pp512 @ d8096 |         75.79 ± 0.32 |
| gemma4 26B.A4B Q4_K - Medium   |  16.02 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |   tg128 @ d8096 |         16.72 ± 0.00 |

Current master (commit 5d56eff)

./llama-bench -m google_gemma-4-26B-A4B-it-Q4_K_L.gguf -ngl 100 --ubatch-size 1088 --batch-size 2048 --mmap 0 -fa 1 -d 0,8096 -r 3
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 780M Graphics (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| gemma4 26B.A4B Q4_K - Medium   |  16.02 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |           pp512 |        176.07 ± 1.37 |
| gemma4 26B.A4B Q4_K - Medium   |  16.02 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |           tg128 |         22.33 ± 0.05 |
| gemma4 26B.A4B Q4_K - Medium   |  16.02 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |   pp512 @ d8096 |         76.54 ± 0.34 |
| gemma4 26B.A4B Q4_K - Medium   |  16.02 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |   tg128 @ d8096 |         18.19 ± 0.08 |

winstonma · 2026-04-28T23:18:07Z

@arch-btw Could you try llama-cli and see if the output is gibberish? Thanks

Also if you feel free could you also test my other PR #22455 alongside with this?

arch-btw · 2026-04-29T00:02:50Z

@winstonma the output is good:

> What is llama.cpp?

[Start thinking]
Here's a thinking process that leads to the explanation of llama.cpp:

1.  **Analyze the Request:**
    *   **Question:** "What is llama.cpp?"
    *   **Intent:** The user wants a clear, concise, and informative overview of the software library known as `llama.cpp`.
    *   **Target Audience:** Likely someone interested in AI, LLMs (Large Language Models), open-source software, or local AI deployment. The explanation should cover technical aspects without being overly dense, and highlight *why* it matters.

etc.

I will try the other commit next! Thank you for these PRs.

winstonma · 2026-04-29T00:30:56Z

Good to hear the results. Actually I start seeing benchmark improvement only when both PR are merged together.

Apart front these two commits AI also identified serveral smaller optimization for UMA vulkan path so I will implement, test and create PR if benchmark show promising result.

jeffbolznv · 2026-04-29T03:33:32Z

+    if (dst->device->uma && (dst->memory_property_flags & vk::MemoryPropertyFlagBits::eHostVisible)) {
+        GGML_ASSERT(dst->memory_property_flags & vk::MemoryPropertyFlagBits::eHostCoherent);
+        if (width == spitch) {
+            deferred_memcpy((uint8_t *) dst->ptr + offset, src, width * height, &subctx->in_memcpys);


I don't think this is correct for the same reasons I commented in #20018. The async copies need to run on the queue to stay in order with other commands.

Thanks for the review. I am not familiar with these. I asked Codex to write a test case to verify the the async copies and passes the test case. And here is the follow up question that I asked:

Yes, the code is implemented to stay ordered with other backend work.

In the UMA host-visible branch at if (dst->device->uma && (dst->memory_property_flags & vk::MemoryPropertyFlagBits::eHostVisible)), the copy is not executed immediately. It is queued via deferred_memcpy into subctx in_memcpys.

Those queued host writes are flushed only when the context is submitted, in ggml_vk_run_deferred_uploads and ggml_vk_submit_transfer_ctx.

For compute-path submission, deferred uploads are run right before submit in ggml_vk_run_deferred_uploads(compute_ctx);. For transfer-path submission, same behavior is in ggml_vk_run_deferred_uploads(cpy_ctx);.

The async tensor API routes into this path from ggml_backend_vk_set_tensor_async, so these copies participate in the same submission/sync chain as other backend commands.

If transfer queue is enabled, cross-queue ordering is linked by timeline semaphore signal/wait in ctx->transfer_semaphore.value++;, and result->s->wait_semaphores.push_back(ctx->transfer_semaphore);.

So for the code specifically, ordering is preserved because writes are deferred and then flushed at queue-submit boundaries, not applied out-of-band.

You need to be familiar with it. Copy-pasting AI responses into maintainer questions is not allowed because we do not have time or patience to debate an AI that can make up wrong claims way faster than any human could debunk them.

Frankly I'm not quite sure I follow the question. But I tried to add some log and see if the question is being answered. This is the debug log:

❯ ./build-vk-debug/bin/llama-cli -m ~/model/gemma-4-E4B-it-UD-Q4_K_XL.gguf -p "Hello" -n 16 2>&1 | grep VK_TIMELINE_HANDSHAKE Loading model... |VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=1 source=ggml_vk_submit_transfer_ctx VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=1 last_waited=0 source=ggml_vk_synchronize VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=2 source=ggml_vk_submit_transfer_ctx VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=2 last_waited=1 source=ggml_vk_synchronize \VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=3 source=ggml_vk_submit_transfer_ctx VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=3 last_waited=2 source=ggml_vk_synchronize VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=4 source=ggml_vk_submit_transfer_ctx VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=4 last_waited=3 source=ggml_vk_synchronize ▄▄ ▄▄ ██ ██ ██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄ ██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██ ██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀ ██ ██ ▀▀ ▀▀ build : b8960-fe1eb0302 model : gemma-4-E4B-it-UD-Q4_K_XL.gguf modalities : text available commands: /exit or Ctrl+C stop or exit /regen regenerate the last response /clear clear the chat history /read <file> add a text file /glob <pattern> add text files using globbing pattern > Hello |VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=5 source=ggml_vk_submit_transfer_ctx VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=5 last_waited=4 source=ggml_vk_synchronize VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=6 source=ggml_vk_submit_transfer_ctx VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=6 last_waited=5 source=ggml_vk_synchronize -VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=7 source=ggml_vk_submit_transfer_ctx VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=7 last_waited=6 source=ggml_vk_synchronize VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=8 source=ggml_vk_submit_transfer_ctx VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=8 last_waited=7 source=ggml_vk_synchronize HelloVK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=9 source=ggml_vk_submit_transfer_ctx VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=9 last_waited=8 source=ggml_vk_synchronize VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=10 source=ggml_vk_submit_transfer_ctx VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=10 last_waited=9 source=ggml_vk_synchronize !VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=11 source=ggml_vk_submit_transfer_ctx VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=11 last_waited=10 source=ggml_vk_synchronize VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=12 source=ggml_vk_submit_transfer_ctx VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=12 last_waited=11 source=ggml_vk_synchronize HowVK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=13 source=ggml_vk_submit_transfer_ctx VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=13 last_waited=12 source=ggml_vk_synchronize VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=14 source=ggml_vk_submit_transfer_ctx VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=14 last_waited=13 source=ggml_vk_synchronize canVK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=15 source=ggml_vk_submit_transfer_ctx VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=15 last_waited=14 source=ggml_vk_synchronize VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=16 source=ggml_vk_submit_transfer_ctx VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=16 last_waited=15 source=ggml_vk_synchronize IVK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=17 source=ggml_vk_submit_transfer_ctx VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=17 last_waited=16 source=ggml_vk_synchronize VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=18 source=ggml_vk_submit_transfer_ctx VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=18 last_waited=17 source=ggml_vk_synchronize helpVK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=19 source=ggml_vk_submit_transfer_ctx VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=19 last_waited=18 source=ggml_vk_synchronize VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=20 source=ggml_vk_submit_transfer_ctx VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=20 last_waited=19 source=ggml_vk_synchronize youVK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=21 source=ggml_vk_submit_transfer_ctx VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=21 last_waited=20 source=ggml_vk_synchronize VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=22 source=ggml_vk_submit_transfer_ctx VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=22 last_waited=21 source=ggml_vk_synchronize todayVK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=23 source=ggml_vk_submit_transfer_ctx VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=23 last_waited=22 source=ggml_vk_synchronize VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=24 source=ggml_vk_submit_transfer_ctx VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=24 last_waited=23 source=ggml_vk_synchronize ?VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=25 source=ggml_vk_submit_transfer_ctx VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=25 last_waited=24 source=ggml_vk_synchronize VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=26 source=ggml_vk_submit_transfer_ctx VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=26 last_waited=25 source=ggml_vk_synchronize [ Prompt: 71.1 t/s | Generation: 18.2 t/s ]

According to the log, the Vulkan Timeline Semaphore have created a system where the Compute Queue is physically incapable of outrunning the data being moved by the Transfer Queue. Thus the ordering is maintained. Also, the Compute Queue is hardware-blocked (bound by a Vulkan Timeline Semaphore wait operation) until the Transfer Queue signals completion, there is no risk of the GPU reading "stale" or partially written memory.

Disabling Transfer Queue on AMD UMA

I also submitted another PR to disable to the transfer queue on the AMD UMA. If the Transfer Queue is disabled, the code path would naturally fall back to a single-queue model where all operations are submitted to the Compute Queue. In this scenario, ordering is maintained by default due to the sequential nature of command submission within a single Vulkan queue.

Regardless of the transfer queue or compute queue, ordering is maintained for commands you submit to the queue. That does not apply to deferred memcpys. in_memcpys run on queue submission. out_memcpys run (in specific cases) after a fence wait that makes sure all queue commands are done. This will not work with the backend async read/write functions because those assume that the commands run in the right order in the queue.

It may work in your tests because you get lucky and the order works out, but this is not guaranteed. This change is fundamentally unsafe.

Thanks for detailed explanation. I made a commit based on previous comment.

The commit moves the execution of out_memcpys is deferred from ggml_vk_compute_forward to ggml_vk_synchronize (ensures that the memcpy only occurs after the GPU fence has signaled completion). Also prevents dropped copies that might occur if a tensor's weak context reference was unset before the synchronization happened. Thus the ordering is enforced by the fence.

in_memcpys is consumed before GPU work submission, also it is cleared after submission (in ggml_vk_submit_transfer_ctx) instead of being cleared later during synchronization (avoid it from being re-executed). Thus ensure transfer complete before GPU work begins.

Conclusion

in_memcpys is executed before submit

In the UMA write path ggml_vk_buffer_write_2d_async, instead of going through a staging buffer + vkCmdCopyBuffer, the data is deferred into subctx->in_memcpys. These are then memcpy'd before ggml_vk_submit is called, at two sites:

ggml_vk_compute_forward : loops in_memcpys → memcpy → then submits the command buffer.

ggml_vk_synchronize: same pattern for any remaining in_memcpys on the compute context.

ggml_vk_submit_transfer_ctx: same for the transfer queue context.

out_memcpys executed after fence

In the UMA read path ggml_vk_buffer_read_2d_async, the read is deferred into subctx->out_memcpys. These are consumed only after waitForFences succeeds in ggml_vk_synchronize

fence signals GPU done → loop ctx->gc.contexts → for each tensor_ctx->out_memcpys: memcpy(dst, src, n)

… be completed or discarded at the wrong time instead of becoming visible only after the relevant GPU work has finished

…hitecture) memory transfer thresholds - Automatic Calibration: Added ggml_vk_calibrate_uma_thresholds, which benchmarks memcpy (CPU) against vkCmdCopyBuffer (GPU) for sizes ranging from 4KB to 4MB. It identifies the "crossover point" where GPU copies become more efficient than direct CPU access. - Per-Device Thresholds: Added uma_read_threshold and uma_write_threshold to the vk_device_struct to store these calibrated values for each device. - Initialization: The calibration process is now triggered during ggml_vk_init. - Refined Threshold Logic: - Updated ggml_vk_uma_non_cached_direct_read_threshold and ggml_vk_uma_non_cached_direct_write_threshold to accept a vk_device reference. - Implemented a caching mechanism for environment variable overrides (GGML_VK_UMA_NON_CACHED_DIRECT_READ_THRESHOLD and GGML_VK_UMA_NON_CACHED_DIRECT_WRITE_THRESHOLD). - The system now prioritizes environment variables; if none are provided, it falls back to the calibrated value, and finally to the hardcoded default. - Call-site Updates: Updated ggml_vk_use_uma_direct_read and ggml_vk_use_uma_direct_write to pass the device context to the threshold lookup functions.

- Centralized Threshold Parsing The code replaces duplicated logic for parsing environment variables (like GGML_VK_UMA_NON_CACHED_DIRECT_READ_THRESHOLD) with a single template function: ggml_vk_parse_uma_threshold<DEFAULT_VALUE>(env_var_name): This centralizes the logic for checking the environment variable, parsing the value, and handling errors or overflows. It simplifies ggml_vk_uma_non_cached_direct_read_threshold and its write counterpart. - Refactored Benchmarking Logic The UMA calibration process has been decomposed into smaller, more reusable helpers: ggml_vk_benchmark_iterations: A generic helper to average the execution time of an operation over a set number of iterations. ggml_vk_benchmark_uma_threshold: A generic function that iterates through a list of sizes to find the break-even point where a GPU copy becomes faster than a direct CPU memcpy on UMA. ggml_vk_run_uma_benchmarks: Extracts the actual benchmarking setup (buffer creation, warmup, and execution) from the calibration entry point. ggml_vk_calibrate_uma_thresholds: Now serves as a clean entry point that simply triggers the benchmarks if the device is UMA. - Unified UMA Transfer Decisions The logic for deciding whether to use a direct host-mapped transfer instead of a GPU copy is now encapsulated in two new helpers: ggml_vk_is_uma_host_visible: Checks if a buffer is host-visible on a UMA device. ggml_vk_should_use_uma_direct_transfer: Combines the visibility check with the threshold check (read vs. write). Usage: These helpers are now used throughout the file (e.g., in ggml_vk_buffer_read, ggml_vk_buffer_write_2d_async, and ggml_backend_vk_set/get_tensor_async), replacing repetitive inline checks. - 2D Copy Cleanup ggml_vk_deferred_memcpy_2d: A new helper that handles 2D memory copies (considering strides/pitches). This removes duplicated loops in both the 2D read and write async functions.

…e memory is not host-cached In UMA systems, this usually reflects the trade-off between the overhead of orchestrating a DMA transfer (via staging buffers) versus the slower raw access speed of non-cached memory. Typically, direct access is more efficient for small buffers where the DMA setup overhead dominates, while DMA/staging is preferred for large buffers to leverage higher throughput. This change aligns the code with that typical performance characteristic.

…ama.cpp into winston/vk-uma-read-threshold

…selection on UMA devices The benchmark is used to calibrate uma_read_threshold and uma_write_threshold, which determine when to prefer direct host memory access over GPU transfers on UMA devices. By measuring only the pure GPU copy time, the thresholds were biased—they made GPU transfers appear faster than they actually are in production code paths that use staging buffers. With these changes, the benchmark now accurately reflects the total cost of the GPU transfer path, including the necessary memcpy overhead. This produces more realistic threshold values that won't incorrectly favor GPU transfers when direct host access would actually be faster.

The dead code returns false (fall back) for all non-UMA-direct cases when sync_staging is false, including the pinned-memory path below it. The pinned memory path doesn't need a staging buffer, so this early return incorrectly skips it.

…_vk_buffer_memset

winstonma · 2026-05-04T09:39:55Z

Thank you @jeffbolznv and @0cc4m for the detailed feedback on ordering correctness.

The PR is reworked the approach to address the core concern: the deferred memcpys now have proper Vulkan memory dependency coverage rather than relying on timing alone.

Specifically, I added two functions — ggml_vk_record_host_write_barrier and ggml_vk_record_host_read_barrier — that record explicit pipelineBarrier calls into the command buffer:

Write path: HOST_WRITE → SHADER_READ/TRANSFER_READ barrier is recorded into the command buffer after the CPU memcpys from in_memcpys have completed, just before vkQueueSubmit. This ensures the GPU sees the host writes as formally visible.
Read path: SHADER_WRITE/TRANSFER_WRITE → HOST_READ barrier is recorded into the command buffer at ctx_end time (before submit), so by the time waitForFences returns and out_memcpys are drained, the GPU writes are formally visible to the host.

…d to ensure GPU synchronization before reading

1. prevents host_data (the destination) from accumulating warmth across CPU iterations before GPU gets measured 2. correctness fix for the write path by returning the largest size tested if CPU win all size

1. Unify the read/write carlibration size, also use binary search to replace linear search to reduce the carlibration time 2. Refactor ggml_vk_record_host_write_barrier and ggml_vk_record_host_read_barrier

1. Based on the testing, the memcpy write is always faster than vulkan write, so the write threshold benchmarking is being removed. Also the write path is simplified 2. Set the default read threshold to 16KB

winstonma added 2 commits April 28, 2026 10:19

vulkan: avoid preferring transfer queue on AMD UMA devices

80e793a

Optimize Vulkan buffer transfers on UMA (Unified Memory Architecture)…

da5e315

… devices

winstonma requested a review from a team as a code owner April 28, 2026 08:48

github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Apr 28, 2026

winstonma force-pushed the winston/vk-uma-read-threshold branch from e95b92d to da5e315 Compare April 28, 2026 13:43

jeffbolznv reviewed Apr 29, 2026

View reviewed changes

winstonma added 9 commits May 1, 2026 15:16

fix incorrect async/event ordering on Vulkan, where a host read could…

bd5db36

… be completed or discarded at the wrong time instead of becoming visible only after the relevant GPU work has finished

implement UMA write threshold to avoid non-cached memory penalty

91176d3

Merge commit 'refs/pull/22455/head' of https://github.com/ggml-org/ll…

4ecad4f

…ama.cpp into winston/vk-uma-read-threshold

Fixed the indentation inconsistency

9309d72

winstonma added 7 commits May 4, 2026 13:16

added UMA write thresholding for ggml_vk_buffer_memset_async and ggml…

bd7701b

…_vk_buffer_memset

revert prefers_transfer_queue definition and comments

6f85dc0

use read/write barrier to address potential race conditions

8f2fb72

cleanup and deduplication

d38def3

cleanup and deduplication ggml-org#2

e7db9e1

follows best practices for handling platform-specific size differences

630716e

added a two-line comment explaining the contract

5c78cdd

winstonma added 8 commits May 4, 2026 19:23

Revert "fix incorrect async/event ordering on Vulkan"

ab18b5a

making the barrier a silent no-op

b58976a

removed the premature UMA direct transfer check in ggml_vk_buffer_rea…

bd0a0ff

…d to ensure GPU synchronization before reading

fixing slow read speed

2aad038

fix calibration

c52584d

1. prevents host_data (the destination) from accumulating warmth across CPU iterations before GPU gets measured 2. correctness fix for the write path by returning the largest size tested if CPU win all size

remove flush cache

e176a81

refactor and optimize the calibration process

8c67e77

1. Unify the read/write carlibration size, also use binary search to replace linear search to reduce the carlibration time 2. Refactor ggml_vk_record_host_write_barrier and ggml_vk_record_host_read_barrier

adjusted the read/write logic

0ae5d6a

1. Based on the testing, the memcpy write is always faster than vulkan write, so the write threshold benchmarking is being removed. Also the write path is simplified 2. Set the default read threshold to 16KB

winstonma closed this May 11, 2026

winstonma mentioned this pull request May 12, 2026

vulkan: avoid preferring transfer queue on AMD UMA devices #22455

Open

ElSnacko mentioned this pull request May 16, 2026

Speculative decoding with draft model: extreme slowdown on Vulkan (UMA iGPU) when both models on same device #23126

Open

Conversation

winstonma commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Write Transfer Time Test (ggml_backend_tensor_set):

Read Transfer Time Test (ggml_backend_tensor_get):

Requirements

Uh oh!

ggml-gh-bot Bot commented Apr 28, 2026

Uh oh!

winstonma commented Apr 28, 2026

Uh oh!

engrtipusultan commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

engrtipusultan commented Apr 28, 2026

Uh oh!

winstonma commented Apr 28, 2026

Uh oh!

winstonma commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

engrtipusultan commented Apr 28, 2026

Uh oh!

engrtipusultan commented Apr 28, 2026

Uh oh!

winstonma commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arch-btw commented Apr 28, 2026

Uh oh!

winstonma commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arch-btw commented Apr 29, 2026

Uh oh!

winstonma commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

winstonma Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0cc4m Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

winstonma Apr 30, 2026

Choose a reason for hiding this comment

Disabling Transfer Queue on AMD UMA

Uh oh!

0cc4m Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

winstonma May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Conclusion

in_memcpys is executed before submit

out_memcpys executed after fence

Uh oh!

winstonma commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

winstonma commented Apr 28, 2026 •

edited

Loading

Write Transfer Time Test (`ggml_backend_tensor_set`):

Read Transfer Time Test (`ggml_backend_tensor_get`):

engrtipusultan commented Apr 28, 2026 •

edited

Loading

winstonma commented Apr 28, 2026 •

edited

Loading

winstonma commented Apr 28, 2026 •

edited

Loading

winstonma commented Apr 28, 2026 •

edited

Loading

winstonma commented Apr 29, 2026 •

edited

Loading

winstonma Apr 29, 2026 •

edited

Loading

winstonma May 1, 2026 •

edited

Loading

`in_memcpys` is executed before submit

`out_memcpys` executed after fence