Epic: Pre-load model memory requirement estimation

## Summary
mlxcel cannot currently tell a user, before (or at) model load, how much memory a given model will require for a given context length — so the first signal of an over-capacity model is an allocation failure mid-load. This epic adds an accurate, layered memory estimator (weights + KV cache + runtime/activation headroom), validates it against ground-truth MLX runtime memory, and surfaces it through a dedicated `mlxcel inspect` subcommand and an `--estimate-memory` preflight on `generate`/`serve` that aborts (with override) when a model won't fit.

## Background / current state
Memory is three layers with different predictability:

| Layer | Driven by | Predictability |
|---|---|---|
| Weights | config (hidden/layers/vocab) + quant bits, or safetensors header | static, exact |
| KV cache | layers × kv_heads × head_dim × dtype × context_len × batch | exact once context fixed |
| Activation / MLX pool | seq_len, graph intermediates, Metal buffer pool | dynamic, needs runtime measurement |

What already exists (and its gaps):
- `estimate_model_params_billions()` — `src/execution/quant_advisor.rs:45` — config → param count (B). Exposed via `mlxcel generate --recommend-quant`. Produces a *count*, not bytes.
- `build_model_profile()` + `ModelProfile::total_param_bytes()` — `src/distributed/pipeline/partition_profile.rs:64`, `src/distributed/pipeline/partition.rs:152` — config + `quantization.bits` → bytes (few % accurate). **Only used in the distributed/pipeline path**, not in single-device load / CLI.
- `recommend_quantization()` — `src/lib/mlxcel-core/src/hardware.rs:426` — already decides fit using ~2/1/0.5 bytes-per-param + a flat `KV_CACHE_HEADROOM_GB = 2` constant vs total unified memory.
- `CachePool::memory_usage_bytes()` — runtime KV usage (measurement, not estimate).
- Hardware unified-memory total via `sysctlbyname("hw.memsize")` — `src/lib/mlxcel-core/src/hardware.rs:256`.

Missing pieces this epic fills: byte-accurate weights from the safetensors header (the `metadata.total_size` in `model.safetensors.index.json` is parsed-but-discarded by `parse_shard_index` in `src/lib/mlxcel-core/src/weights.rs`), a real KV-cache estimator (the flat 2 GB constant is wrong for long context / many layers), MLX runtime-memory FFI (`get_active_memory`/`get_peak_memory`/`set_memory_limit`/`set_cache_limit` are not wrapped anywhere), and a unified estimator wired into the single-device CLI.

## Deliverables (sub-issues)
- [ ] #53 — Exact weight-byte accounting from safetensors metadata
- [ ] #54 — KV-cache memory estimator (replaces the flat 2 GB constant)
- [ ] #55 — MLX runtime memory FFI wrapping (ground-truth measurement + limits)
- [ ] #56 — Unified estimator + `mlxcel inspect` subcommand + generate/serve preflight (integration capstone)

## Design decisions (confirmed with maintainer)
- CLI surface: **both** a dedicated read-only `mlxcel inspect <model>` subcommand AND an `--estimate-memory` preflight flag on `generate`/`serve`.
- Over-capacity behavior: **abort load with a clear error**, with a `--force` / `--no-memory-check` override.
- Scope: **includes** MLX runtime-memory FFI for ground-truth measurement and OOM-guarding limits.

## Epic acceptance criteria (must be functionally integrated, not standalone helpers)
- `mlxcel inspect <model> [--max-tokens N] [--quant ...] [--cache-type-k/v ...]` prints a breakdown — weights (exact from header, analytical fallback), KV per-token and at the requested context, activation/runtime headroom, total vs available unified memory, fits/doesn't-fit — without running generation.
- `mlxcel generate`/`serve --estimate-memory` runs the same estimate as a preflight and **aborts with a clear, actionable error** when total > available, unless `--force`/`--no-memory-check` is given.
- After a successful load, the estimate is validated against MLX `get_active_memory()` and the delta is logged; the activation-headroom factor is calibrated from real `get_peak_memory()` measurements (not a guess).
- The estimator is reachable from the single-device load path (not only the distributed/pipeline path), with the analytical `ModelProfile` retained as a fallback.
- All four sub-issues merged and integrated; the one estimator is wired into `inspect`, the preflight, and `--recommend-quant` (no duplicate logic, no orphaned modules).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: Pre-load model memory requirement estimation #52

Summary

Background / current state

Deliverables (sub-issues)

Design decisions (confirmed with maintainer)

Epic acceptance criteria (must be functionally integrated, not standalone helpers)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Layer	Driven by	Predictability
Weights	config (hidden/layers/vocab) + quant bits, or safetensors header	static, exact
KV cache	layers × kv_heads × head_dim × dtype × context_len × batch	exact once context fixed
Activation / MLX pool	seq_len, graph intermediates, Metal buffer pool	dynamic, needs runtime measurement

Epic: Pre-load model memory requirement estimation #52

Description

Summary

Background / current state

Deliverables (sub-issues)

Design decisions (confirmed with maintainer)

Epic acceptance criteria (must be functionally integrated, not standalone helpers)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions