common: improve --fit host-memory accounting for CPU and iGPU by fl0rianr · Pull Request #22922 · ggml-org/llama.cpp

fl0rianr · 2026-05-10T21:04:28Z

Overview

This PR improves --fit by adding host-backed memory accounting for cases where host memory is the limiting budget.

The main affected paths are:

CPU-only inference
UMA / integrated GPU inference when the backend reports GGML_BACKEND_DEVICE_TYPE_IGPU

The change keeps the existing pre-load fitting logic, but records a context-memory profile and performs an optional post-load context adjustment once model weights have been loaded and the actual remaining host budget is known.

Problem

On UMA systems, GPU memory and host memory are backed by the same physical RAM.

For example, a system with 128 GiB RAM may allow the iGPU to use 64 GiB, 96 GiB, or even more.
For testing regarding this PR 120 GB was used for the gfx1151 iGPU.
If the OS and other processes already use a large part of system memory, loading a large model can still trigger hard OOM kills even though the device-reported memory budget appears sufficient.

The current fit implementation can't prevent these OOM Events since host RAM and device VRAM are considered independly
and shared memory is not considered. For integrated GPUs this overestimates the amount of memory that is physically available.

Host budget is generally not considered in a fully correct way.

Approach

This PR adds a post-load host-memory pressure pass:

CPU and iGPU memory budgets are combined not separated
during normal --fit, record a context-memory profile when the limiting path uses host-backed memory
load model weights first
re-check the actual remaining host/device budget
reduce n_ctx before llama_init_from_model() if needed

The pass is intentionally limited to host-backed paths:

CPU-only
single GGML_BACKEND_DEVICE_TYPE_IGPU

For non-iGPU GPU paths, the existing and correct functional per-device fitting logic remains the main behavior.

OS branching

On Linux, the host budget is estimated from memory that can be used without relying on swap:

MemFree
weighted Inactive(file) based on vm.swappiness
SReclaimable
minus zone high watermarks

The Linux path also considers cgroup memory limits, so Docker / container cases are handled.

Windows and macOS use best-effort host memory availability APIs. Tests not yet performed.

Path impact evaluation (to be updated)

Path	Risk	Evaluation
CPU-only	low	Covered by the post-load host-memory pass. Tested.
AMD iGPU Vulkan	low	Vulkan reports `IGPU`; single-iGPU UMA path is covered. Tested.
ROCm / HIP AMD iGPU	low*	Covered when the backend reports `IGPU`. Tested with separate iGPU detection fix.
Metal / Apple UMA	low	Metal currently reports `GPU`, so the iGPU UMA path is not used. Host check is best-effort only.
CUDA / HIP / Vulkan dGPU	low	Reports `GPU`; existing per-device fitting remains the main path.
CUDA multi-dGPU	low	Tested with RTX 4090 + RTX 5090. Existing per-device fitting path remains active.
OpenCL / SYCL / OpenVINO / CANN / WebGPU / RPC	low	Expected unaffected unless a backend reports `IGPU`. Smoke test for OpenCL in logs.
Explicit single iGPU via `params.devices`	low / medium	`IGPU` triggers the host-backed path by design. Tested only for selected backends.
Multiple iGPUs	medium	Not covered yet. Follow-up needed.
Explicit dGPU + iGPU	medium	Not covered by the post-load host-backed pass. Follow-up needed.
Default dGPU + iGPU	low	llama.cpp only adds iGPUs by default when no dGPU is used.
LLAMA_SPLIT_MODE_TENSOR / Meta device	n/a	--fit rejects this mode early.
BLAS / zDNN / ZenDNN / CPU accelerators	low	CPU/accelerator style path; not an iGPU UMA path.

* ROCm/HIP iGPU testing used a separate iGPU detection fix that is not part of this PR.

Limitations

This PR does not try to solve every multi-device UMA case.

Known follow-up topics:

multiple iGPUs
explicit mixed dGPU + iGPU setups
backends that have UMA hardware but currently report GPU
improving best-effort host budget estimation on Windows and macOS

Logs / testing

CPU only (master) unsuccesful fit:
llama-fit-128k-times-10-cpu_master.log

CPU only with this PR and PR #22939 :
llama-fit-128k-times-10-cpu_v2.log

Fixing the CPU case here as well is intentionally inside the scope of this PR (for now).
This can change, but it also helps testing the host-memory path on more systems and OSes.

AMD Strix Halo 128 GiB Vulkan this PR and PR #22939:
llama-fit-128k-times-10_gfx1151_vulkan_v2.log

AMD Strix Halo 128 GiB ROCm/HIP this PR and PR #22939 (with a sperarte iGPU detection fix #23007):
llama-fit-128k-times-10_gfx1151_rocm_v2.log

Nvidia RTX 5090 32 GB CUDA:
5090_single_dGPU_load.log

Nvidia RTX 4090 24 GB + 5090 32 GB CUDA:
4090_5090_2dGPU_load.log

Intel iGPU openCL with IGPU set explicitly by user:
Intel_iGPU_openCL_smoke_test.log

Additional information

Instead, the Linux path uses a more precise host budget estimate that avoids treating all reclaimable memory as equally available and subtracts kernel high watermarks. This addresses the observed Linux failure mode without applying a broad artificial margin. Hence this special memory detection code sections were added.

Regarding observed system stability during testing on Linux Ubuntu 26.04 (current master):

CPU only OOM kill --> unsuccessful model load / OOM kill on time
iGPU only load (host clean) OOM kill --> much faster allocation / OOM kill on time
iGPU only load (browser active) OOM kill --> Browser OOM killed load to fast, Llama OOM kill too late, season restart after e.g. 30 s are potential outcomes

Windows and macOS remain best-effort because their memory pressure APIs expose different information.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure:
YES, was used to code implement my core ideas presented above in a a coherent form. I'm taking resposiblity and I'm prepared to discuss changes and revise the implementation accordingly.

Add a second post-load refinement pass for --fit on CPU-only and single-device UMA/iGPU paths. The initial fit pass runs before the real model load and can overestimate the remaining host-memory budget once model weights, context buffers, compute buffers, and page cache compete for the same physical memory. Record a small context-memory profile during the existing no_alloc fit probe, then after model load re-check the shared host-memory budget and reduce n_ctx when required. Linux uses a host-pressure estimate based on free memory, reclaimable file/slab cache, and zone high watermarks, with optional cgroup v1/v2 limit clamping when available. Tested on Linux. Signed-off-by: Florian Reinle <f.reinle@otec.de>

fl0rianr · 2026-05-10T21:14:47Z

Hi @JohannesGaessler, I opened this as a draft because the code is functional in my tested paths, but the PR touches --fit behavior in a significant way.

Linux is tested more thoroughly, while Windows/macOS are best-effort for now. Additional testing from other users in the community would be very helpful.

So I think discussion on scope and approach is appropriate before marking it ready for review or merge.

JohannesGaessler · 2026-05-10T21:15:07Z

If I understand the intent behind this PR correctly it is the wrong approach. As of right now there is no proper implementation of "free" host memory in ggml_backend_cpu_device_get_memory. That is where free host memory should be calculated, the changes to fit.cpp should then use that value and be minimal.

fl0rianr · 2026-05-10T21:20:41Z

Alright I can separate the changes, those regarding host memory should be in ggml_backend_cpu_device_get_memory. Thanks.

Move the host free-memory calculation into ggml_backend_cpu_device_get_memory() so common/fit.cpp only consumes backend-reported memory values. Keep the tested --fit post-load context adjustment for CPU-only and single-iGPU host-backed paths, while making the fit changes smaller and easier to review. Signed-off-by: Florian Reinle <f.reinle@otec.de>

fl0rianr · 2026-05-10T22:30:37Z

Done. I suppose some hard discussions ahead... It's a very crucial topic in my point of view. Thanks again.

JohannesGaessler · 2026-05-10T23:01:52Z

Maybe I should have been more clear about this: the changes to the CPU backend should be their own PR since they can be reviewed and merged separately. In particular, I am not an expert regarding how to best estimate free host memory.

fl0rianr · 2026-05-11T10:04:27Z

True, that is the correct approach, I was very much focused on fixing --fit. I opened the CPU backend memory change as a separate PR: #22939.

I will remove the ggml CPU memory changes from this PR in the next commit, so this PR stays focused purely on --fit.

Before potentially simplifying it further, I would like to understand the preferred direction for fit.

My current local attempt was to avoid running fit twice, once for weights and once for context. That has not worked cleanly so far.

In light of that, the two-step approach may actually be the better direction for the future: let the load path expose separate weight/context memory estimates to the caller, so --fit can make more precise override decisions before llama_init_from_model().

That could also help potential follow-up work where external callers load weights and context more explicitly/separately, similar to what llama.cpp already does internally.

I am happy to change direction. I just want to avoid polishing the wrong approach.

JohannesGaessler · 2026-05-11T11:07:28Z

If there is a proper estimate for free host memory, simply use that to determine the context size reduction (with consideration for not double-counting unified memory). After that basically no changes should be needed as the algorithm will try to fill the GPU up first and push everything else to host memory.

Account for non-mmap single-iGPU shared host/device memory pressure during fit estimation without relying on model postload information. Improved CPU-only fit behavior by including CPU repack buffers in the host memory breakdown. Changes reduced to a minimum Signed-off-by: Florian Reinle <f.reinle@otec.de>

fl0rianr · 2026-05-12T22:18:25Z

Reduced changes to a minimum as requested. CPU fix improved, so full memory can be utilized but no swapping occurs.

dGPU tested, still fully valid, no new logs uploaded for this case since logic remains the same. CPU, iGPU Vulkan, iGPU Rocm logs updated.

fl0rianr requested review from a team and JohannesGaessler as code owners May 10, 2026 21:04

fl0rianr marked this pull request as draft May 10, 2026 21:04

github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label May 10, 2026

fl0rianr mentioned this pull request May 11, 2026

ggml-cpu: avoid treating all host RAM as free #22939

Open

fl0rianr marked this pull request as ready for review May 12, 2026 22:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

common: improve --fit host-memory accounting for CPU and iGPU#22922

common: improve --fit host-memory accounting for CPU and iGPU#22922
fl0rianr wants to merge 3 commits into
ggml-org:masterfrom
fl0rianr:fix/fit_host_memory_CPU_iGPU

fl0rianr commented May 10, 2026 •

edited

Loading

Uh oh!

fl0rianr commented May 10, 2026

Uh oh!

JohannesGaessler commented May 10, 2026

Uh oh!

fl0rianr commented May 10, 2026

Uh oh!

fl0rianr commented May 10, 2026

Uh oh!

JohannesGaessler commented May 10, 2026

Uh oh!

fl0rianr commented May 11, 2026

Uh oh!

JohannesGaessler commented May 11, 2026

Uh oh!

fl0rianr commented May 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fl0rianr commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Problem

Approach

OS branching

Path impact evaluation (to be updated)

Limitations

Logs / testing

Additional information

Requirements

Uh oh!

fl0rianr commented May 10, 2026

Uh oh!

JohannesGaessler commented May 10, 2026

Uh oh!

fl0rianr commented May 10, 2026

Uh oh!

fl0rianr commented May 10, 2026

Uh oh!

JohannesGaessler commented May 10, 2026

Uh oh!

fl0rianr commented May 11, 2026

Uh oh!

JohannesGaessler commented May 11, 2026

Uh oh!

fl0rianr commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fl0rianr commented May 10, 2026 •

edited

Loading

fl0rianr commented May 12, 2026 •

edited

Loading