UMA buffers prefer host-visible memory#22930
Conversation
|
Hi @winstonma, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
|
Which buffers need this? What performance benefits do you see? I think this function is generally used for internal allocations that are not transferred to/from. |
95fdd41 to
08319b9
Compare
|
After further testing, I determined that the necessary logic could be handled more effectively on the other side. Consequently, I have replaced the previous commit with a refined implementation. Context & InvestigationWhile implementing zero-copy support for the UMA path (aiming for parity with the Metal backend), I developed a test script to evaluate performance gains. Initial benchmark results were underwhelming; investigation revealed that the root cause was a memory allocation issue that prevented zero-copy from functioning correctly. Therefore this commit is being submitted. BenchmarksTesting conducted is testing latency of moving data from the backend (GPU/Vulkan) back to the host (CPU) over 1 million time on AMD Radeon 880M Graphics: And the uncommitted code change would improve the speed to ~0.2 s Lastly since it is not on the hot memory read/write path the pp and tg performance won't be improved. |
|
Even for UMA systems, preferring device-local is still correct. It may still be "easier" to access for the GPU, go through less intermediate steps, depending on the architecture. Posting random unverifiable numbers is not gonna help your case, we'd need something reproduceable that shows what cases this improves. There are a lot of devices this would affect. |
|
The benchmark results was posted as requested. The performance gain comes from how ggml-vulkan handles
Would you mind explaining the disadvantage/limitations of falling back to device-local memory? |
|
Your attitude is wrong, it is not my job to disprove your claims, it is your job to prove them. A benchmark result is not worth anything to me if I cannot reproduce it, and you didn't even share your script. But you're in luck, I just found an easily measurable performance issue on Nvidia DGX Spark caused by preferring just device-local memory, and this kind of change is needed to fix it. |
Co-authored-by: Ruben Ortlam <picard12@live.de>
I am sorry that I make you feel this. I could agree that I don't have all the uma architecture so I am not certain, like you mentioned, how the change would affect others. So I would sincerely want to see how the fallback would affect other uma architecture (e.g. driver problem that allow the system to assign Since you would like to reproduce the benchmark situation. Should I commit the cpp file in this branch? Would the cpp file get merged? This is my concern. Sorry again. |
Vulkan runs on many devices, so changes like this affect a lot of architectures. Besides your AMD APU, there are older AMD devices with different architectures, Intel Xe iGPUs, Intel UHD iGPUs, Nvidia iGPUs and even phone SoCs that this affects. That's why some care is necessary. Same as with the other PR, you can upload benchmark code to a Github gist. |
|
@rillomas That's a lot of failures in the Linux Intel Vulkan CI runner. I don't really see how this change could cause that, can you take a look? |
Overview
On UMA (Unified Memory Architecture) devices the CPU and GPU share the same physical memory. Despite this, the Vulkan backend was still allocating device-local buffers without the eHostVisible flag, which prevented the CPU from directly accessing GPU tensor data. This meant that even on hardware where a zero-copy path was physically possible, the backend was forced to go through an unnecessary staging copy whenever tensor data needed to be read back to the host (e.g. during loss evaluation or prediction readback in training). The UMA zero copy was not implemented in this PR.
Additional information
Requirements