Summary
mlxcel cannot currently tell a user, before (or at) model load, how much memory a given model will require for a given context length — so the first signal of an over-capacity model is an allocation failure mid-load. This epic adds an accurate, layered memory estimator (weights + KV cache + runtime/activation headroom), validates it against ground-truth MLX runtime memory, and surfaces it through a dedicated mlxcel inspect subcommand and an --estimate-memory preflight on generate/serve that aborts (with override) when a model won't fit.
Background / current state
Memory is three layers with different predictability:
| Layer |
Driven by |
Predictability |
| Weights |
config (hidden/layers/vocab) + quant bits, or safetensors header |
static, exact |
| KV cache |
layers × kv_heads × head_dim × dtype × context_len × batch |
exact once context fixed |
| Activation / MLX pool |
seq_len, graph intermediates, Metal buffer pool |
dynamic, needs runtime measurement |
What already exists (and its gaps):
estimate_model_params_billions() — src/execution/quant_advisor.rs:45 — config → param count (B). Exposed via mlxcel generate --recommend-quant. Produces a count, not bytes.
build_model_profile() + ModelProfile::total_param_bytes() — src/distributed/pipeline/partition_profile.rs:64, src/distributed/pipeline/partition.rs:152 — config + quantization.bits → bytes (few % accurate). Only used in the distributed/pipeline path, not in single-device load / CLI.
recommend_quantization() — src/lib/mlxcel-core/src/hardware.rs:426 — already decides fit using ~2/1/0.5 bytes-per-param + a flat KV_CACHE_HEADROOM_GB = 2 constant vs total unified memory.
CachePool::memory_usage_bytes() — runtime KV usage (measurement, not estimate).
- Hardware unified-memory total via
sysctlbyname("hw.memsize") — src/lib/mlxcel-core/src/hardware.rs:256.
Missing pieces this epic fills: byte-accurate weights from the safetensors header (the metadata.total_size in model.safetensors.index.json is parsed-but-discarded by parse_shard_index in src/lib/mlxcel-core/src/weights.rs), a real KV-cache estimator (the flat 2 GB constant is wrong for long context / many layers), MLX runtime-memory FFI (get_active_memory/get_peak_memory/set_memory_limit/set_cache_limit are not wrapped anywhere), and a unified estimator wired into the single-device CLI.
Deliverables (sub-issues)
Design decisions (confirmed with maintainer)
- CLI surface: both a dedicated read-only
mlxcel inspect <model> subcommand AND an --estimate-memory preflight flag on generate/serve.
- Over-capacity behavior: abort load with a clear error, with a
--force / --no-memory-check override.
- Scope: includes MLX runtime-memory FFI for ground-truth measurement and OOM-guarding limits.
Epic acceptance criteria (must be functionally integrated, not standalone helpers)
mlxcel inspect <model> [--max-tokens N] [--quant ...] [--cache-type-k/v ...] prints a breakdown — weights (exact from header, analytical fallback), KV per-token and at the requested context, activation/runtime headroom, total vs available unified memory, fits/doesn't-fit — without running generation.
mlxcel generate/serve --estimate-memory runs the same estimate as a preflight and aborts with a clear, actionable error when total > available, unless --force/--no-memory-check is given.
- After a successful load, the estimate is validated against MLX
get_active_memory() and the delta is logged; the activation-headroom factor is calibrated from real get_peak_memory() measurements (not a guess).
- The estimator is reachable from the single-device load path (not only the distributed/pipeline path), with the analytical
ModelProfile retained as a fallback.
- All four sub-issues merged and integrated; the one estimator is wired into
inspect, the preflight, and --recommend-quant (no duplicate logic, no orphaned modules).
Summary
mlxcel cannot currently tell a user, before (or at) model load, how much memory a given model will require for a given context length — so the first signal of an over-capacity model is an allocation failure mid-load. This epic adds an accurate, layered memory estimator (weights + KV cache + runtime/activation headroom), validates it against ground-truth MLX runtime memory, and surfaces it through a dedicated
mlxcel inspectsubcommand and an--estimate-memorypreflight ongenerate/servethat aborts (with override) when a model won't fit.Background / current state
Memory is three layers with different predictability:
What already exists (and its gaps):
estimate_model_params_billions()—src/execution/quant_advisor.rs:45— config → param count (B). Exposed viamlxcel generate --recommend-quant. Produces a count, not bytes.build_model_profile()+ModelProfile::total_param_bytes()—src/distributed/pipeline/partition_profile.rs:64,src/distributed/pipeline/partition.rs:152— config +quantization.bits→ bytes (few % accurate). Only used in the distributed/pipeline path, not in single-device load / CLI.recommend_quantization()—src/lib/mlxcel-core/src/hardware.rs:426— already decides fit using ~2/1/0.5 bytes-per-param + a flatKV_CACHE_HEADROOM_GB = 2constant vs total unified memory.CachePool::memory_usage_bytes()— runtime KV usage (measurement, not estimate).sysctlbyname("hw.memsize")—src/lib/mlxcel-core/src/hardware.rs:256.Missing pieces this epic fills: byte-accurate weights from the safetensors header (the
metadata.total_sizeinmodel.safetensors.index.jsonis parsed-but-discarded byparse_shard_indexinsrc/lib/mlxcel-core/src/weights.rs), a real KV-cache estimator (the flat 2 GB constant is wrong for long context / many layers), MLX runtime-memory FFI (get_active_memory/get_peak_memory/set_memory_limit/set_cache_limitare not wrapped anywhere), and a unified estimator wired into the single-device CLI.Deliverables (sub-issues)
mlxcel inspectsubcommand + generate/serve preflight (integration capstone)Design decisions (confirmed with maintainer)
mlxcel inspect <model>subcommand AND an--estimate-memorypreflight flag ongenerate/serve.--force/--no-memory-checkoverride.Epic acceptance criteria (must be functionally integrated, not standalone helpers)
mlxcel inspect <model> [--max-tokens N] [--quant ...] [--cache-type-k/v ...]prints a breakdown — weights (exact from header, analytical fallback), KV per-token and at the requested context, activation/runtime headroom, total vs available unified memory, fits/doesn't-fit — without running generation.mlxcel generate/serve --estimate-memoryruns the same estimate as a preflight and aborts with a clear, actionable error when total > available, unless--force/--no-memory-checkis given.get_active_memory()and the delta is logged; the activation-headroom factor is calibrated from realget_peak_memory()measurements (not a guess).ModelProfileretained as a fallback.inspect, the preflight, and--recommend-quant(no duplicate logic, no orphaned modules).