Skip to content

ggml-cpu: avoid treating all host RAM as free#22939

Open
fl0rianr wants to merge 1 commit into
ggml-org:masterfrom
fl0rianr:fix/cpu_host_memory_estimation
Open

ggml-cpu: avoid treating all host RAM as free#22939
fl0rianr wants to merge 1 commit into
ggml-org:masterfrom
fl0rianr:fix/cpu_host_memory_estimation

Conversation

@fl0rianr
Copy link
Copy Markdown
Contributor

Overview

This PR makes CPU use a more conservative host memory estimate instead of treating total RAM as free memory on Linux.

It is linked to a PR improving common/fit.cpp PR #22922

Approach

On Linux, the host budget is estimated from memory that can be used without relying on swap:

  • MemFree
  • weighted Inactive(file) based on vm.swappiness
  • SReclaimable
  • minus zone high watermarks
  • visible cgroup memory limits

Windows already uses the OS available-physical-memory value and was smoke-tested.
macOS uses a best-effort host memory API path, but I could not test it yet (no device available).

Logs / testing

Tested CPU-only --fit on Linux and Windows.

CPU only Linux (master) unsuccesful fit:
llama-fit-128k-times-10-cpu_master.log

CPU only Linux with this PR and PR #22922 :
llama-fit-128k-times-10-cpu.log

Linux logs are from the PR #22922 since they are still valid.

CPU only windows (master):
windows_fit_load_16GB_host_full_swap_pressue_master.log

CPU only windows with this PR and PR #22922 :
Windows_16GB_system_with_fit_PR22922_modell_load.log

In the Windows 16 GB test (roughly 12 GB already claimed), the old path selected a much larger context and KV cache.
With this PR, the post-load host-pressure check kept a smaller context based on the available host budget.

Limitations

Users who intentionally rely on swap to fit a larger context may now get a smaller automatically fitted context.
Explicit user settings are still available when that behavior is wanted.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure:
    YES, was used to help to turn my implementation idea into a cleaner patch.
    I reviewed the result and take responsibility for the code and follow-up changes.

Estimate a conservative Linux host memory budget for CPU --fit instead
of reporting total physical memory as available. The estimate uses
MemFree, reclaimable file/cache memory, zone watermarks, and visible
cgroup limits to avoid selecting contexts that rely heavily on swap or
memory reclaim.

Keep the Windows path based on the OS available-memory value, and use a
minimal best-effort macOS availability path instead of the old total-RAM
fallback.
@fl0rianr fl0rianr requested a review from ggerganov as a code owner May 11, 2026 09:45
@github-actions github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label May 11, 2026
@Geramy
Copy link
Copy Markdown

Geramy commented May 11, 2026

I can test this on my Mac for this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants