Fix excessive memory allocation for static-shape attention ops by Pranaykarvi · Pull Request #2636 · apple/coremltools

Pranaykarvi · 2025-12-30T16:57:39Z

Summary

This PR fixes excessive memory allocation for Transformer attention ops when the
sequence length is statically known at compile time (e.g. seq_len=128).

For static-shape attention, Metal may eagerly allocate large intermediate buffers
(e.g. QKᵀ matrices), which can lead to multi-GB allocations and OOM on iOS devices.
The existing attention slicing pass was gated behind a high sequence-length
threshold and did not trigger for smaller static shapes.

This change enables memory-efficient attention slicing for static sequence lengths
while preserving the existing behavior for dynamic-shape models.

Problem

When exporting Transformer models with a statically known sequence length,
scaled_dot_product_attention may materialize large intermediate tensors during
lowering. On iOS, this can result in excessive Metal buffer allocation (observed
~10GB) and OOM during inference or benchmarking, even for relatively small models
(e.g. Llama-style models with seq_len=128).

Solution

Detect whether the attention sequence length is statically known.
For static shapes with sequence length ≥ 64, apply the existing
scaled_dot_product_attention_sliced_q pass to break the computation into
smaller chunks and reduce peak memory usage.
Preserve the original slicing threshold (1280) and behavior for
dynamic-shape models to avoid unnecessary overhead.

This approach limits the change to the pathological static-shape case and avoids
global behavior changes.

Testing

Added a regression test that constructs a Transformer-style attention block with
a static sequence length (seq_len=128).
The test verifies:
- The sequence length is statically known.
- The attention op is replaced by sliced logic.
- Intermediate tensor sizes remain below a conservative safety bound, preventing
  pathological buffer materialization.
Tests run at conversion time only and do not require iOS or Metal execution.

Notes

This fix targets mobile inference scenarios where static shapes can trigger
eager buffer allocation.
Dynamic-shape models are intentionally unaffected.

Fixes #2590.

Pranaykarvi · 2025-12-31T13:09:50Z

Hi @TobyRoseman , just a gentle follow-up in case this slipped through.
Happy to make any adjustments if needed, thanks!

TobyRoseman

Your new unit tests don't pass.

TobyRoseman · 2026-01-02T19:03:09Z

coremltools/converters/mil/mil/passes/defs/transformer.py

 from coremltools.converters.mil._deployment_compatibility import AvailableTarget as target
 from coremltools.converters.mil.mil import Builder as mb
 from coremltools.converters.mil.mil import types
+from coremltools.converters.mil.mil.types.symbolic import is_symbolic


I don't think you need this line.

TobyRoseman · 2026-01-02T19:10:37Z

coremltools/converters/mil/mil/passes/defs/transformer.py

+            # allocation issues as static shapes, so the higher threshold is appropriate.
            logger.debug(
-                f"skipping SDPA op, Q seq_length is {q_seq_length} (minimum seq length needed: {self._min_seq_length}"
+                f"skipping SDPA op, Q seq_length is dynamic (symbolic), "


This shouldn't be a f-string since there is no variable being used.

Looks like this is also an issue in several other places of this PR.

TobyRoseman · 2026-01-02T19:12:08Z

coremltools/converters/mil/mil/passes/pass_pipeline.py

    "common::remove_symbolic_reshape",
    "common::noop_elimination",
+    # Apply attention slicing early to reduce memory allocation for static sequence lengths.
+    # This pass replaces scaled_dot_product_attention with a memory-efficient sliced implementation.


Remove this line of the comment. It doesn't really add much and can easily become outdated/inaccurate.

TobyRoseman · 2026-01-02T19:13:36Z

coremltools/converters/mil/mil/passes/defs/transformer.py

        Defines the size of the chunks of Q being processed in SDPA (chunk_size = seq_length / seq_length_divider)
    """

+    # Default threshold for dynamic-shape models. Dynamic shapes use runtime allocation


The added comments in this file are far too long. They need to be much concise.

TobyRoseman · 2026-01-02T19:18:21Z