Skip to content

Epic: Pre-load model memory requirement estimation #52

@inureyes

Description

@inureyes

Summary

mlxcel cannot currently tell a user, before (or at) model load, how much memory a given model will require for a given context length — so the first signal of an over-capacity model is an allocation failure mid-load. This epic adds an accurate, layered memory estimator (weights + KV cache + runtime/activation headroom), validates it against ground-truth MLX runtime memory, and surfaces it through a dedicated mlxcel inspect subcommand and an --estimate-memory preflight on generate/serve that aborts (with override) when a model won't fit.

Background / current state

Memory is three layers with different predictability:

Layer Driven by Predictability
Weights config (hidden/layers/vocab) + quant bits, or safetensors header static, exact
KV cache layers × kv_heads × head_dim × dtype × context_len × batch exact once context fixed
Activation / MLX pool seq_len, graph intermediates, Metal buffer pool dynamic, needs runtime measurement

What already exists (and its gaps):

  • estimate_model_params_billions()src/execution/quant_advisor.rs:45 — config → param count (B). Exposed via mlxcel generate --recommend-quant. Produces a count, not bytes.
  • build_model_profile() + ModelProfile::total_param_bytes()src/distributed/pipeline/partition_profile.rs:64, src/distributed/pipeline/partition.rs:152 — config + quantization.bits → bytes (few % accurate). Only used in the distributed/pipeline path, not in single-device load / CLI.
  • recommend_quantization()src/lib/mlxcel-core/src/hardware.rs:426 — already decides fit using ~2/1/0.5 bytes-per-param + a flat KV_CACHE_HEADROOM_GB = 2 constant vs total unified memory.
  • CachePool::memory_usage_bytes() — runtime KV usage (measurement, not estimate).
  • Hardware unified-memory total via sysctlbyname("hw.memsize")src/lib/mlxcel-core/src/hardware.rs:256.

Missing pieces this epic fills: byte-accurate weights from the safetensors header (the metadata.total_size in model.safetensors.index.json is parsed-but-discarded by parse_shard_index in src/lib/mlxcel-core/src/weights.rs), a real KV-cache estimator (the flat 2 GB constant is wrong for long context / many layers), MLX runtime-memory FFI (get_active_memory/get_peak_memory/set_memory_limit/set_cache_limit are not wrapped anywhere), and a unified estimator wired into the single-device CLI.

Deliverables (sub-issues)

Design decisions (confirmed with maintainer)

  • CLI surface: both a dedicated read-only mlxcel inspect <model> subcommand AND an --estimate-memory preflight flag on generate/serve.
  • Over-capacity behavior: abort load with a clear error, with a --force / --no-memory-check override.
  • Scope: includes MLX runtime-memory FFI for ground-truth measurement and OOM-guarding limits.

Epic acceptance criteria (must be functionally integrated, not standalone helpers)

  • mlxcel inspect <model> [--max-tokens N] [--quant ...] [--cache-type-k/v ...] prints a breakdown — weights (exact from header, analytical fallback), KV per-token and at the requested context, activation/runtime headroom, total vs available unified memory, fits/doesn't-fit — without running generation.
  • mlxcel generate/serve --estimate-memory runs the same estimate as a preflight and aborts with a clear, actionable error when total > available, unless --force/--no-memory-check is given.
  • After a successful load, the estimate is validated against MLX get_active_memory() and the delta is logged; the activation-headroom factor is calibrated from real get_peak_memory() measurements (not a guess).
  • The estimator is reachable from the single-device load path (not only the distributed/pipeline path), with the analytical ModelProfile retained as a fallback.
  • All four sub-issues merged and integrated; the one estimator is wired into inspect, the preflight, and --recommend-quant (no duplicate logic, no orphaned modules).

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:cliCommand-line interface / CLI flagsarea:coremlxcel-core: MLX FFI, primitives, KV cache, layerspriority:mediumMedium prioritystatus:doneCompletedtype:enhancementNew features, capabilities, or significant additions

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions