common: improve --fit host-memory accounting for CPU and iGPU#22922
common: improve --fit host-memory accounting for CPU and iGPU#22922fl0rianr wants to merge 3 commits into
Conversation
Add a second post-load refinement pass for --fit on CPU-only and single-device UMA/iGPU paths. The initial fit pass runs before the real model load and can overestimate the remaining host-memory budget once model weights, context buffers, compute buffers, and page cache compete for the same physical memory. Record a small context-memory profile during the existing no_alloc fit probe, then after model load re-check the shared host-memory budget and reduce n_ctx when required. Linux uses a host-pressure estimate based on free memory, reclaimable file/slab cache, and zone high watermarks, with optional cgroup v1/v2 limit clamping when available. Tested on Linux. Signed-off-by: Florian Reinle <f.reinle@otec.de>
|
Hi @JohannesGaessler, I opened this as a draft because the code is functional in my tested paths, but the PR touches Linux is tested more thoroughly, while Windows/macOS are best-effort for now. Additional testing from other users in the community would be very helpful. So I think discussion on scope and approach is appropriate before marking it ready for review or merge. |
|
If I understand the intent behind this PR correctly it is the wrong approach. As of right now there is no proper implementation of "free" host memory in |
|
Alright I can separate the changes, those regarding host memory should be in |
Move the host free-memory calculation into ggml_backend_cpu_device_get_memory() so common/fit.cpp only consumes backend-reported memory values. Keep the tested --fit post-load context adjustment for CPU-only and single-iGPU host-backed paths, while making the fit changes smaller and easier to review. Signed-off-by: Florian Reinle <f.reinle@otec.de>
|
Done. I suppose some hard discussions ahead... It's a very crucial topic in my point of view. Thanks again. |
|
Maybe I should have been more clear about this: the changes to the CPU backend should be their own PR since they can be reviewed and merged separately. In particular, I am not an expert regarding how to best estimate free host memory. |
|
True, that is the correct approach, I was very much focused on fixing --fit. I opened the CPU backend memory change as a separate PR: #22939. I will remove the ggml CPU memory changes from this PR in the next commit, so this PR stays focused purely on --fit. Before potentially simplifying it further, I would like to understand the preferred direction for fit. My current local attempt was to avoid running fit twice, once for weights and once for context. That has not worked cleanly so far. In light of that, the two-step approach may actually be the better direction for the future: let the load path expose separate weight/context memory estimates to the caller, so --fit can make more precise override decisions before That could also help potential follow-up work where external callers load weights and context more explicitly/separately, similar to what llama.cpp already does internally. I am happy to change direction. I just want to avoid polishing the wrong approach. |
|
If there is a proper estimate for free host memory, simply use that to determine the context size reduction (with consideration for not double-counting unified memory). After that basically no changes should be needed as the algorithm will try to fill the GPU up first and push everything else to host memory. |
Account for non-mmap single-iGPU shared host/device memory pressure during fit estimation without relying on model postload information. Improved CPU-only fit behavior by including CPU repack buffers in the host memory breakdown. Changes reduced to a minimum Signed-off-by: Florian Reinle <f.reinle@otec.de>

Overview
This PR improves
--fitby adding host-backed memory accounting for cases where host memory is the limiting budget.The main affected paths are:
GGML_BACKEND_DEVICE_TYPE_IGPUThe change keeps the existing pre-load fitting logic, but records a context-memory profile and performs an optional post-load context adjustment once model weights have been loaded and the actual remaining host budget is known.
Problem
On UMA systems, GPU memory and host memory are backed by the same physical RAM.
For example, a system with 128 GiB RAM may allow the iGPU to use 64 GiB, 96 GiB, or even more.
For testing regarding this PR 120 GB was used for the gfx1151 iGPU.
If the OS and other processes already use a large part of system memory, loading a large model can still trigger hard OOM kills even though the device-reported memory budget appears sufficient.
The current fit implementation can't prevent these OOM Events since host RAM and device VRAM are considered independly
and shared memory is not considered. For integrated GPUs this overestimates the amount of memory that is physically available.
Host budget is generally not considered in a fully correct way.
Approach
This PR adds a post-load host-memory pressure pass:
n_ctxbeforellama_init_from_model()if neededThe pass is intentionally limited to host-backed paths:
GGML_BACKEND_DEVICE_TYPE_IGPUFor non-iGPU GPU paths, the existing and correct functional per-device fitting logic remains the main behavior.
OS branching
On Linux, the host budget is estimated from memory that can be used without relying on swap:
The Linux path also considers cgroup memory limits, so Docker / container cases are handled.
Windows and macOS use best-effort host memory availability APIs. Tests not yet performed.
Path impact evaluation (to be updated)
IGPU; single-iGPU UMA path is covered. Tested.IGPU. Tested with separate iGPU detection fix.GPU, so the iGPU UMA path is not used. Host check is best-effort only.GPU; existing per-device fitting remains the main path.IGPU. Smoke test for OpenCL in logs.params.devicesIGPUtriggers the host-backed path by design. Tested only for selected backends.* ROCm/HIP iGPU testing used a separate iGPU detection fix that is not part of this PR.
Limitations
This PR does not try to solve every multi-device UMA case.
Known follow-up topics:
GPULogs / testing
CPU only (master) unsuccesful fit:
llama-fit-128k-times-10-cpu_master.log
CPU only with this PR and PR #22939 :
llama-fit-128k-times-10-cpu_v2.log
Fixing the CPU case here as well is intentionally inside the scope of this PR (for now).
This can change, but it also helps testing the host-memory path on more systems and OSes.
AMD Strix Halo 128 GiB Vulkan this PR and PR #22939:
llama-fit-128k-times-10_gfx1151_vulkan_v2.log
AMD Strix Halo 128 GiB ROCm/HIP this PR and PR #22939 (with a sperarte iGPU detection fix #23007):
llama-fit-128k-times-10_gfx1151_rocm_v2.log
Nvidia RTX 5090 32 GB CUDA:
5090_single_dGPU_load.log
Nvidia RTX 4090 24 GB + 5090 32 GB CUDA:
4090_5090_2dGPU_load.log
Intel iGPU openCL with
IGPUset explicitly by user:Intel_iGPU_openCL_smoke_test.log
Additional information
Instead, the Linux path uses a more precise host budget estimate that avoids treating all reclaimable memory as equally available and subtracts kernel high watermarks. This addresses the observed Linux failure mode without applying a broad artificial margin. Hence this special memory detection code sections were added.
Regarding observed system stability during testing on Linux Ubuntu 26.04 (current master):
Windows and macOS remain best-effort because their memory pressure APIs expose different information.
Requirements
YES, was used to code implement my core ideas presented above in a a coherent form. I'm taking resposiblity and I'm prepared to discuss changes and revise the implementation accordingly.