linear_quantize_activations leaks ~1 temp .mlpackage per calibration op-group (tens of GB on realistic models)

# `linear_quantize_activations` leaks ~1 temp `.mlpackage` + `.mlmodelc` per calibration op-group (tens of GB on realistic models)

## Environment

- `coremltools==9.0` (pip)
- macOS 15 (Darwin 25.0.0), Apple Silicon
- Python 3.12

## Summary

During activation-statistics collection in
`coremltools.optimize.coreml.experimental.linear_quantize_activations`,
`_ModelDebugger.predict_intermediate_outputs` builds a fresh temporary
`MLModel` per op-group batch (loop at
`coremltools/optimize/coreml/experimental/_model_debugger.py:315-320`).
Each of those `MLModel(spec, weights_dir=…)` constructions allocates a
new `.mlpackage` under `$TMPDIR` via
`_create_mlpackage` → `tempfile.mkdtemp(suffix=".mlpackage")`
(`coremltools/models/utils.py:114`).

The only cleanup is the `atexit.register(cleanup, self.package_path)`
at `coremltools/models/model.py:569`, so **nothing is reclaimed until
the process exits.** The compiled `.mlmodelc` that CoreML creates when
the model is loaded (`skip_model_load=False`, which is the default for
calibration) is likewise retained.

Measured on a medium-sized transformer (Gemma 4 E2B chunk2, ~300 MB
mlpackage): a single calibration run with 8 calibration samples
accumulates **~80 temp directories totaling ~38 GB** in `$TMPDIR`
before the process exits. With 49 GB free on the dev disk, the run
fills the partition and the calibration aborts.

Confirmed on a minimal self-contained reproducer (`reproduce_leak.py`,
attached) — 4 calibration samples against an 8-layer 2048-dim
`nn.Linear` stack leaks **+6 temporary `.mlpackage` dirs, +402.9 MB**:

```
$ TMPDIR=/tmp/leak-test python reproduce_leak.py
...
Before: 0 mlpackage, 0 mlmodelc, 0.0 MB
Running linear_quantize_activations (activation calibration) ...
After:  6 mlpackage, 0 mlmodelc, 402.9 MB
Leaked: +6 mlpackage, +0 mlmodelc, +402.9 MB
```

That linearly scales with calibration samples × op-group batches × model size.
On Gemma 4 chunk2 (300 MB × ~128 batches) the 38 GB figure above
matches perfectly.

**With a `MLModel.__del__`/`close()` fix** (proposed patch attached)
on exactly the same reproducer:

```
After:  0 mlpackage, 0 mlmodelc, 0.0 MB
Leaked: +0 mlpackage, +0 mlmodelc, +0.0 MB
```

## Reproducer

Self-contained, no external weights needed (attached as
`reproduce_leak.py`):

```python
import coremltools as ct
from coremltools.optimize.coreml.experimental import (
    OpLinearQuantizerConfig, OptimizationConfig, linear_quantize_activations,
)
# … build a small MLModel from a torch.nn.Linear stack …
quantized = linear_quantize_activations(
    mlmodel, OptimizationConfig(global_config=OpLinearQuantizerConfig(
        mode="linear_symmetric", dtype="int8")),
    sample_data=[{"x": np.random.randn(1, 2048).astype(np.float32)} for _ in range(4)],
)
del quantized
# $TMPDIR now contains leftover tmp*.mlpackage and tmp*.mlmodelc directories.
```

Expected output (counts and bytes retained) is printed by the script
before and after the call.

## Expected vs observed

- **Expected**: each temporary `MLModel` used only for prediction
  during calibration releases its backing directory when it goes out
  of scope.
- **Observed**: the backing directory survives until the Python
  process exits.

## Root cause

`MLModel.__init__` ties the temp directory lifetime to an `atexit`
hook instead of the Python object's lifetime. There is no
`__del__` / `close()` path. For long-running calibration loops this
is exactly the wrong lifetime.

## Proposed fix

Two options, both small:

1. **Per-call site** — wrap the MLModel in
   `_model_debugger.predict_intermediate_outputs` with a try/finally
   that `rmtree`s `model.package_path` (and drops `model.__proxy__`)
   after each iteration. Minimal surface, only touches the
   quantization path that actually hits the problem.

2. **MLModel-level** — add `MLModel.close()` / `__del__` that release
   the temp package eagerly. Fixes every caller that constructs
   MLModel from a spec (`ct.optimize` helpers, `ct.convert` internals,
   custom pipelines). The `atexit` hook stays as a safety net.

Patch for option 2 attached (`fix_mlmodel_tmp_leak.patch`).

Happy to open a PR with a regression test that asserts `$TMPDIR` count
does not grow across a calibration loop, if that would be useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

linear_quantize_activations leaks ~1 temp .mlpackage per calibration op-group (tens of GB on realistic models) #2670

`linear_quantize_activations` leaks ~1 temp `.mlpackage` + `.mlmodelc` per calibration op-group (tens of GB on realistic models)

Environment

Summary

Reproducer

Expected vs observed

Root cause

Proposed fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

linear_quantize_activations leaks ~1 temp .mlpackage per calibration op-group (tens of GB on realistic models) #2670

Description

linear_quantize_activations leaks ~1 temp .mlpackage + .mlmodelc per calibration op-group (tens of GB on realistic models)

Environment

Summary

Reproducer

Expected vs observed

Root cause

Proposed fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`linear_quantize_activations` leaks ~1 temp `.mlpackage` + `.mlmodelc` per calibration op-group (tens of GB on realistic models)