Skip to content

common: improve --fit host-memory accounting for CPU and iGPU#22922

Open
fl0rianr wants to merge 3 commits into
ggml-org:masterfrom
fl0rianr:fix/fit_host_memory_CPU_iGPU
Open

common: improve --fit host-memory accounting for CPU and iGPU#22922
fl0rianr wants to merge 3 commits into
ggml-org:masterfrom
fl0rianr:fix/fit_host_memory_CPU_iGPU

Conversation

@fl0rianr
Copy link
Copy Markdown
Contributor

@fl0rianr fl0rianr commented May 10, 2026

Overview

This PR improves --fit by adding host-backed memory accounting for cases where host memory is the limiting budget.

The main affected paths are:

  • CPU-only inference
  • UMA / integrated GPU inference when the backend reports GGML_BACKEND_DEVICE_TYPE_IGPU

The change keeps the existing pre-load fitting logic, but records a context-memory profile and performs an optional post-load context adjustment once model weights have been loaded and the actual remaining host budget is known.

Problem

On UMA systems, GPU memory and host memory are backed by the same physical RAM.

For example, a system with 128 GiB RAM may allow the iGPU to use 64 GiB, 96 GiB, or even more.
For testing regarding this PR 120 GB was used for the gfx1151 iGPU.
If the OS and other processes already use a large part of system memory, loading a large model can still trigger hard OOM kills even though the device-reported memory budget appears sufficient.

The current fit implementation can't prevent these OOM Events since host RAM and device VRAM are considered independly
and shared memory is not considered. For integrated GPUs this overestimates the amount of memory that is physically available.

Host budget is generally not considered in a fully correct way.

Approach

This PR adds a post-load host-memory pressure pass:

  • CPU and iGPU memory budgets are combined not separated
  • during normal --fit, record a context-memory profile when the limiting path uses host-backed memory
  • load model weights first
  • re-check the actual remaining host/device budget
  • reduce n_ctx before llama_init_from_model() if needed

The pass is intentionally limited to host-backed paths:

  • CPU-only
  • single GGML_BACKEND_DEVICE_TYPE_IGPU

For non-iGPU GPU paths, the existing and correct functional per-device fitting logic remains the main behavior.

OS branching

On Linux, the host budget is estimated from memory that can be used without relying on swap:

  • MemFree
  • weighted Inactive(file) based on vm.swappiness
  • SReclaimable
  • minus zone high watermarks

The Linux path also considers cgroup memory limits, so Docker / container cases are handled.

Windows and macOS use best-effort host memory availability APIs. Tests not yet performed.

Path impact evaluation (to be updated)

Path Risk Evaluation
CPU-only low Covered by the post-load host-memory pass. Tested.
AMD iGPU Vulkan low Vulkan reports IGPU; single-iGPU UMA path is covered. Tested.
ROCm / HIP AMD iGPU low* Covered when the backend reports IGPU. Tested with separate iGPU detection fix.
Metal / Apple UMA low Metal currently reports GPU, so the iGPU UMA path is not used. Host check is best-effort only.
CUDA / HIP / Vulkan dGPU low Reports GPU; existing per-device fitting remains the main path.
CUDA multi-dGPU low Tested with RTX 4090 + RTX 5090. Existing per-device fitting path remains active.
OpenCL / SYCL / OpenVINO / CANN / WebGPU / RPC low Expected unaffected unless a backend reports IGPU. Smoke test for OpenCL in logs.
Explicit single iGPU via params.devices low / medium IGPU triggers the host-backed path by design. Tested only for selected backends.
Multiple iGPUs medium Not covered yet. Follow-up needed.
Explicit dGPU + iGPU medium Not covered by the post-load host-backed pass. Follow-up needed.
Default dGPU + iGPU low llama.cpp only adds iGPUs by default when no dGPU is used.
LLAMA_SPLIT_MODE_TENSOR / Meta device n/a --fit rejects this mode early.
BLAS / zDNN / ZenDNN / CPU accelerators low CPU/accelerator style path; not an iGPU UMA path.

* ROCm/HIP iGPU testing used a separate iGPU detection fix that is not part of this PR.

Limitations

This PR does not try to solve every multi-device UMA case.

Known follow-up topics:

  • multiple iGPUs
  • explicit mixed dGPU + iGPU setups
  • backends that have UMA hardware but currently report GPU
  • improving best-effort host budget estimation on Windows and macOS

Logs / testing

CPU only (master) unsuccesful fit:
llama-fit-128k-times-10-cpu_master.log

CPU only with this PR and PR #22939 :
llama-fit-128k-times-10-cpu_v2.log

Fixing the CPU case here as well is intentionally inside the scope of this PR (for now).
This can change, but it also helps testing the host-memory path on more systems and OSes.

AMD Strix Halo 128 GiB Vulkan this PR and PR #22939:
llama-fit-128k-times-10_gfx1151_vulkan_v2.log

AMD Strix Halo 128 GiB ROCm/HIP this PR and PR #22939 (with a sperarte iGPU detection fix #23007):
llama-fit-128k-times-10_gfx1151_rocm_v2.log

Nvidia RTX 5090 32 GB CUDA:
5090_single_dGPU_load.log

Nvidia RTX 4090 24 GB + 5090 32 GB CUDA:
4090_5090_2dGPU_load.log

Intel iGPU openCL with IGPU set explicitly by user:
Intel_iGPU_openCL_smoke_test.log

Additional information

Instead, the Linux path uses a more precise host budget estimate that avoids treating all reclaimable memory as equally available and subtracts kernel high watermarks. This addresses the observed Linux failure mode without applying a broad artificial margin. Hence this special memory detection code sections were added.

Regarding observed system stability during testing on Linux Ubuntu 26.04 (current master):

  • CPU only OOM kill --> unsuccessful model load / OOM kill on time
  • iGPU only load (host clean) OOM kill --> much faster allocation / OOM kill on time
  • iGPU only load (browser active) OOM kill --> Browser OOM killed load to fast, Llama OOM kill too late, season restart after e.g. 30 s are potential outcomes

Windows and macOS remain best-effort because their memory pressure APIs expose different information.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure:
    YES, was used to code implement my core ideas presented above in a a coherent form. I'm taking resposiblity and I'm prepared to discuss changes and revise the implementation accordingly.

Add a second post-load refinement pass for --fit on CPU-only and
single-device UMA/iGPU paths.

The initial fit pass runs before the real model load and can
overestimate the remaining host-memory budget once model weights,
context buffers, compute buffers, and page cache compete for the same
physical memory.

Record a small context-memory profile during the existing no_alloc fit
probe, then after model load re-check the shared host-memory budget and
reduce n_ctx when required.

Linux uses a host-pressure estimate based on free memory, reclaimable
file/slab cache, and zone high watermarks, with optional cgroup v1/v2
limit clamping when available.

Tested on Linux.

Signed-off-by: Florian Reinle <f.reinle@otec.de>
@fl0rianr fl0rianr requested review from a team and JohannesGaessler as code owners May 10, 2026 21:04
@fl0rianr fl0rianr marked this pull request as draft May 10, 2026 21:04
@fl0rianr
Copy link
Copy Markdown
Contributor Author

Hi @JohannesGaessler, I opened this as a draft because the code is functional in my tested paths, but the PR touches --fit behavior in a significant way.

Linux is tested more thoroughly, while Windows/macOS are best-effort for now. Additional testing from other users in the community would be very helpful.

So I think discussion on scope and approach is appropriate before marking it ready for review or merge.

@JohannesGaessler
Copy link
Copy Markdown
Contributor

If I understand the intent behind this PR correctly it is the wrong approach. As of right now there is no proper implementation of "free" host memory in ggml_backend_cpu_device_get_memory. That is where free host memory should be calculated, the changes to fit.cpp should then use that value and be minimal.

@fl0rianr
Copy link
Copy Markdown
Contributor Author

Alright I can separate the changes, those regarding host memory should be in ggml_backend_cpu_device_get_memory. Thanks.

Move the host free-memory calculation into ggml_backend_cpu_device_get_memory() so common/fit.cpp only consumes backend-reported memory values.

Keep the tested --fit post-load context adjustment for CPU-only and single-iGPU host-backed paths, while making the fit changes smaller and easier to review.

Signed-off-by: Florian Reinle <f.reinle@otec.de>
@github-actions github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label May 10, 2026
@fl0rianr
Copy link
Copy Markdown
Contributor Author

Done. I suppose some hard discussions ahead... It's a very crucial topic in my point of view. Thanks again.

@JohannesGaessler
Copy link
Copy Markdown
Contributor

Maybe I should have been more clear about this: the changes to the CPU backend should be their own PR since they can be reviewed and merged separately. In particular, I am not an expert regarding how to best estimate free host memory.

@fl0rianr
Copy link
Copy Markdown
Contributor Author

True, that is the correct approach, I was very much focused on fixing --fit. I opened the CPU backend memory change as a separate PR: #22939.

I will remove the ggml CPU memory changes from this PR in the next commit, so this PR stays focused purely on --fit.

Before potentially simplifying it further, I would like to understand the preferred direction for fit.

My current local attempt was to avoid running fit twice, once for weights and once for context. That has not worked cleanly so far.

In light of that, the two-step approach may actually be the better direction for the future: let the load path expose separate weight/context memory estimates to the caller, so --fit can make more precise override decisions before llama_init_from_model().

That could also help potential follow-up work where external callers load weights and context more explicitly/separately, similar to what llama.cpp already does internally.

I am happy to change direction. I just want to avoid polishing the wrong approach.

@JohannesGaessler
Copy link
Copy Markdown
Contributor

If there is a proper estimate for free host memory, simply use that to determine the context size reduction (with consideration for not double-counting unified memory). After that basically no changes should be needed as the algorithm will try to fill the GPU up first and push everything else to host memory.

Account for non-mmap single-iGPU shared host/device memory pressure
during fit estimation without relying on model postload information.

Improved  CPU-only fit behavior by including CPU repack buffers
in the host memory breakdown. Changes reduced to a minimum

Signed-off-by: Florian Reinle <f.reinle@otec.de>
@fl0rianr
Copy link
Copy Markdown
Contributor Author

fl0rianr commented May 12, 2026

Reduced changes to a minimum as requested. CPU fix improved, so full memory can be utilized but no swapping occurs.
Load_CPU
dGPU tested, still fully valid, no new logs uploaded for this case since logic remains the same. CPU, iGPU Vulkan, iGPU Rocm logs updated.

@fl0rianr fl0rianr marked this pull request as ready for review May 12, 2026 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants