ggml-cpu: avoid treating all host RAM as free by fl0rianr · Pull Request #22939 · ggml-org/llama.cpp

fl0rianr · 2026-05-11T09:45:39Z

Overview

This PR makes CPU use a more conservative host memory estimate instead of treating total RAM as free memory on Linux.

It is linked to a PR improving common/fit.cpp PR #22922

Approach

On Linux, the host budget is estimated from memory that can be used without relying on swap:

MemFree
weighted Inactive(file) based on vm.swappiness
SReclaimable
minus zone high watermarks
visible cgroup memory limits

Windows already uses the OS available-physical-memory value and was smoke-tested.
macOS uses a best-effort host memory API path, but I could not test it yet (no device available).

Logs / testing

Tested CPU-only --fit on Linux and Windows.

CPU only Linux (master) unsuccesful fit:
llama-fit-128k-times-10-cpu_master.log

CPU only Linux with this PR and PR #22922 :
llama-fit-128k-times-10-cpu.log

Linux logs are from the PR #22922 since they are still valid.

CPU only windows (master):
windows_fit_load_16GB_host_full_swap_pressue_master.log

CPU only windows with this PR and PR #22922 :
Windows_16GB_system_with_fit_PR22922_modell_load.log

In the Windows 16 GB test (roughly 12 GB already claimed), the old path selected a much larger context and KV cache.
With this PR, the post-load host-pressure check kept a smaller context based on the available host budget.

Limitations

Users who intentionally rely on swap to fit a larger context may now get a smaller automatically fitted context.
Explicit user settings are still available when that behavior is wanted.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure:
YES, was used to help to turn my implementation idea into a cleaner patch.
I reviewed the result and take responsibility for the code and follow-up changes.

Estimate a conservative Linux host memory budget for CPU --fit instead of reporting total physical memory as available. The estimate uses MemFree, reclaimable file/cache memory, zone watermarks, and visible cgroup limits to avoid selecting contexts that rely heavily on swap or memory reclaim. Keep the Windows path based on the OS available-memory value, and use a minimal best-effort macOS availability path instead of the old total-RAM fallback.

Geramy · 2026-05-11T17:26:16Z

I can test this on my Mac for this PR.

fl0rianr requested a review from ggerganov as a code owner May 11, 2026 09:45

github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label May 11, 2026

fl0rianr mentioned this pull request May 11, 2026

common: improve --fit host-memory accounting for CPU and iGPU #22922

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cpu: avoid treating all host RAM as free#22939

ggml-cpu: avoid treating all host RAM as free#22939
fl0rianr wants to merge 1 commit into
ggml-org:masterfrom
fl0rianr:fix/cpu_host_memory_estimation

fl0rianr commented May 11, 2026

Uh oh!

Geramy commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fl0rianr commented May 11, 2026

Overview

Approach

Logs / testing

Limitations

Requirements

Uh oh!

Geramy commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants