ggml-cpu: avoid treating all host RAM as free#22939
Open
fl0rianr wants to merge 1 commit into
Open
Conversation
Estimate a conservative Linux host memory budget for CPU --fit instead of reporting total physical memory as available. The estimate uses MemFree, reclaimable file/cache memory, zone watermarks, and visible cgroup limits to avoid selecting contexts that rely heavily on swap or memory reclaim. Keep the Windows path based on the OS available-memory value, and use a minimal best-effort macOS availability path instead of the old total-RAM fallback.
|
I can test this on my Mac for this PR. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PR makes CPU use a more conservative host memory estimate instead of treating total RAM as free memory on Linux.
It is linked to a PR improving common/fit.cpp PR #22922
Approach
On Linux, the host budget is estimated from memory that can be used without relying on swap:
Windows already uses the OS available-physical-memory value and was smoke-tested.
macOS uses a best-effort host memory API path, but I could not test it yet (no device available).
Logs / testing
Tested CPU-only --fit on Linux and Windows.
CPU only Linux (master) unsuccesful fit:
llama-fit-128k-times-10-cpu_master.log
CPU only Linux with this PR and PR #22922 :
llama-fit-128k-times-10-cpu.log
Linux logs are from the PR #22922 since they are still valid.
CPU only windows (master):
windows_fit_load_16GB_host_full_swap_pressue_master.log
CPU only windows with this PR and PR #22922 :
Windows_16GB_system_with_fit_PR22922_modell_load.log
In the Windows 16 GB test (roughly 12 GB already claimed), the old path selected a much larger context and KV cache.
With this PR, the post-load host-pressure check kept a smaller context based on the available host budget.
Limitations
Users who intentionally rely on swap to fit a larger context may now get a smaller automatically fitted context.
Explicit user settings are still available when that behavior is wanted.
Requirements
YES, was used to help to turn my implementation idea into a cleaner patch.
I reviewed the result and take responsibility for the code and follow-up changes.