feat: --aggressive-offload for Apple Silicon (MPS)#13367
feat: --aggressive-offload for Apple Silicon (MPS)#13367uxtechie wants to merge 4 commits intoComfy-Org:masterfrom
Conversation
Eliminate swap pressure on unified memory systems by: - Force-destroying model parameters via meta device after use - Flushing MPS allocator cache per sampling step - Preserving small models (<1GB, e.g. VAE) via size threshold - Lifecycle callback system for execution cache invalidation Benchmarked on M5 Pro 48GB with FLUX.2 Dev 32B GGUF: - Latency: 50 min → 20 min per image (2.5× improvement) - Stability: 4+ consecutive generations without OOM
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a CLI flag 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@comfy_execution/caching.py`:
- Around line 193-202: NullCache is missing the public invalidation API
clear_all(), violating the cache contract used by CacheSet.init_null_cache and
consumers like PromptExecutor; add a no-op clear_all(self) method to the
NullCache class so callers can unconditionally call cache.clear_all() without
AttributeError (implement as an empty method that does not raise and document it
as a no-op invalidation for null backend).
In `@comfy/model_management.py`:
- Around line 684-688: The current code unconditionally sets EXTRA_RESERVED_VRAM
to 4GB when CPUState.MPS is detected, which changes default MPS memory
heuristics; change this so the 4GB reservation is only applied when the explicit
opt-in flag is set (e.g., AGGRESSIVE_OFFLOAD or a dedicated MPS flag). Update
the CPUState.MPS branch to check the boolean AGGRESSIVE_OFFLOAD (or an
equivalent MPS-specific flag) before assigning EXTRA_RESERVED_VRAM = 4 * 1024 *
1024 * 1024 and before emitting the "MPS detected: reserving 4 GB..." log,
leaving the default path unchanged when the flag is false; ensure you reference
CPUState.MPS, EXTRA_RESERVED_VRAM, and AGGRESSIVE_OFFLOAD in the change so the
behavior is gated behind the opt-in.
- Around line 490-511: The current register_model_destroyed_callback appends
strong references to _on_model_destroyed_callbacks causing previous
PromptExecutor instances (registered via PromptExecutor.__init__) to be
retained; change this to store weak references: use weakref.WeakSet/weakref.ref
for plain callables and weakref.WeakMethod for bound methods, and update the
invocation logic that iterates _on_model_destroyed_callbacks to dereference weak
refs and prune dead entries before calling. Also add an
unregister_model_destroyed_callback(callback) that locates and removes the
matching weakref (or dead entries) so executors can unregister on teardown;
ensure functions that invoked register_model_destroyed_callback are updated to
call unregister during cleanup.
In `@tests-unit/test_aggressive_offload.py`:
- Around line 178-180: The test test_large_model_above_threshold currently
instantiates FakeLinearModel(size_mb=2048) which allocates a huge real tensor;
change the test to avoid allocating backing storage by stubbing or mocking the
reported parameter size instead: either modify FakeLinearModel to accept a flag
(e.g., allocate=False) that only sets reported size metadata without allocating
tensors, or replace the instantiation in test_large_model_above_threshold with a
lightweight fake/mock object that implements the same size-reporting interface
used by the code under test (e.g., .parameters(), .numel(), or a size_mb
property) so the test exercises the destruction-threshold logic without creating
large memory buffers.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 93e9ffdf-231b-4934-a488-77ed0e235f08
📒 Files selected for processing (6)
comfy/cli_args.pycomfy/model_management.pycomfy/samplers.pycomfy_execution/caching.pyexecution.pytests-unit/test_aggressive_offload.py
07c3887 to
d9a4089
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
tests-unit/test_aggressive_offload.py (1)
185-194:⚠️ Potential issue | 🟠 MajorAvoid allocating a real 2 GB tensor in this unit test.
Line 187 creates
FakeLinearModel(size_mb=2048), which materializes a huge backing tensor via Line 24. This can OOM/flake CI and turns a logic test into a machine-capacity test.💡 Suggested change
class FakeLinearModel(nn.Module): @@ - def __init__(self, size_mb: float = 2.0): + def __init__(self, size_mb: float = 2.0, *, allocate: bool = True): super().__init__() # Each float32 param = 4 bytes, so `n` params ≈ size_mb * 1024² / 4 n = int(size_mb * 1024 * 1024 / 4) - self.weight = nn.Parameter(torch.zeros(n, dtype=torch.float32)) + if allocate: + self.weight = nn.Parameter(torch.zeros(n, dtype=torch.float32)) + else: + self.weight = nn.Parameter(torch.zeros(1, dtype=torch.float32)) + self.reported_size_bytes = int(size_mb * 1024 * 1024) @@ def test_large_model_above_threshold(self): """A 2 GB model (UNET/CLIP-sized) must BE above the destruction threshold.""" - model = FakeLinearModel(size_mb=2048) - - model_size = sum(p.numel() * p.element_size() for p in model.parameters()) + model = FakeLinearModel(size_mb=2048, allocate=False) + model_size = model.reported_size_bytes threshold = 1024 * 1024 * 1024 # 1 GB#!/bin/bash # Verify whether the large-threshold test still allocates real backing storage. rg -n -C2 'def __init__\(self, size_mb|torch\.zeros\(n|test_large_model_above_threshold|size_mb=2048' tests-unit/test_aggressive_offload.py🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests-unit/test_aggressive_offload.py` around lines 185 - 194, The test test_large_model_above_threshold currently constructs FakeLinearModel(size_mb=2048) which allocates a real huge tensor; change FakeLinearModel (constructor / __init__) so it no longer materializes a backing buffer for large sizes: compute the intended numel from size_mb and return lightweight fake parameter objects for parameters() (or a small torch.nn.Parameter) whose numel() and element_size() methods return the computed values, or add an allocate=False flag and skip creating the large tensor when false; update test_large_model_above_threshold to use the non-allocating behavior so the assertion on model_size still computes correctly without allocating gigabytes.
🧹 Nitpick comments (1)
tests-unit/test_aggressive_offload.py (1)
217-242: These MPS flush tests don’t currently validate sampler flush behavior.Lines 217-242 only assert local boolean/device facts; they can pass even if the
samplersMPSempty_cache()call path regresses. Consider exercising the actual wrapper/step path with a mockedtorch.mps.empty_cachecall count.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests-unit/test_aggressive_offload.py` around lines 217 - 242, Replace the passive asserts with a test that actually exercises the sampler flush path: patch torch.mps.empty_cache (unittest.mock.patch) and then run the sampler entrypoint that would trigger a flush (the sampler wrapper/step used by your codebase — e.g., the samplers module entry function referenced as "samplers" in the comment) while toggling comfy.model_management.AGGRESSIVE_OFFLOAD and simulating an MPS device; assert that torch.mps.empty_cache was called when mm.AGGRESSIVE_OFFLOAD is True and the sampler/device reports type "mps", and assert it was not called when AGGRESSIVE_OFFLOAD is False or device.type != "mps". Ensure you set mm.AGGRESSIVE_OFFLOAD back to its original value in a finally block and restore any patched device attributes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests-unit/test_aggressive_offload.py`:
- Around line 257-261: Update the test test_flag_defaults_from_cli_args to
assert the actual wiring instead of just attribute presence: import
comfy.model_management and assert that comfy.model_management.AGGRESSIVE_OFFLOAD
== comfy.cli_args.args.aggressive_offload so the test fails if the CLI flag is
not propagated to the model_management constant (keep the existing imports and
names: test_flag_defaults_from_cli_args, comfy.cli_args.args.aggressive_offload,
comfy.model_management.AGGRESSIVE_OFFLOAD).
---
Duplicate comments:
In `@tests-unit/test_aggressive_offload.py`:
- Around line 185-194: The test test_large_model_above_threshold currently
constructs FakeLinearModel(size_mb=2048) which allocates a real huge tensor;
change FakeLinearModel (constructor / __init__) so it no longer materializes a
backing buffer for large sizes: compute the intended numel from size_mb and
return lightweight fake parameter objects for parameters() (or a small
torch.nn.Parameter) whose numel() and element_size() methods return the computed
values, or add an allocate=False flag and skip creating the large tensor when
false; update test_large_model_above_threshold to use the non-allocating
behavior so the assertion on model_size still computes correctly without
allocating gigabytes.
---
Nitpick comments:
In `@tests-unit/test_aggressive_offload.py`:
- Around line 217-242: Replace the passive asserts with a test that actually
exercises the sampler flush path: patch torch.mps.empty_cache
(unittest.mock.patch) and then run the sampler entrypoint that would trigger a
flush (the sampler wrapper/step used by your codebase — e.g., the samplers
module entry function referenced as "samplers" in the comment) while toggling
comfy.model_management.AGGRESSIVE_OFFLOAD and simulating an MPS device; assert
that torch.mps.empty_cache was called when mm.AGGRESSIVE_OFFLOAD is True and the
sampler/device reports type "mps", and assert it was not called when
AGGRESSIVE_OFFLOAD is False or device.type != "mps". Ensure you set
mm.AGGRESSIVE_OFFLOAD back to its original value in a finally block and restore
any patched device attributes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 36f37e85-b5b0-490e-b4a5-0d1b6da94b80
📒 Files selected for processing (2)
comfy_execution/caching.pytests-unit/test_aggressive_offload.py
🚧 Files skipped from review as they are similar to previous changes (1)
- comfy_execution/caching.py
d9a4089 to
c63c5bc
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@comfy_execution/caching.py`:
- Around line 193-203: clear_all() currently clears self.cache and
self.subcaches but leaves LRU/pressure metadata (e.g. used_generation, children,
timestamps) on LRUCache/RAMPressureCache instances, causing stale bookkeeping;
modify comfy_execution.caching.clear_all to iterate over existing cache and
subcache objects (instances of LRUCache and RAMPressureCache) and reset their
internal metadata fields (set used_generation to 0 or appropriate baseline,
clear children collections, clear timestamps structures and any other per-entry
bookkeeping) before or after clearing self.cache/self.subcaches so eviction
behavior is fully reset; reference the LRUCache and RAMPressureCache classes and
their attributes used_generation, children, timestamps when making the changes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ac0b6276-9896-4835-b713-0c5f31568390
📒 Files selected for processing (3)
comfy/model_management.pycomfy_execution/caching.pytests-unit/test_aggressive_offload.py
🚧 Files skipped from review as they are similar to previous changes (2)
- tests-unit/test_aggressive_offload.py
- comfy/model_management.py
ee9abc9 to
ffcc43a
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
comfy/model_management.py (1)
776-802: Consider wrapping individual callback invocations in try/except.The two-phase design (deferred meta move + single batch notification) is well-structured. However, if a registered callback raises an exception at line 801, subsequent callbacks won't execute. While the current sole consumer (
PromptExecutor) is unlikely to throw, defensive handling would improve resilience as more consumers are added.🛡️ Optional: Defensive callback invocation
if _meta_destroy_queue and _on_model_destroyed_callbacks: for cb in _on_model_destroyed_callbacks: - cb("batch") + try: + cb("batch") + except Exception as e: + logging.warning(f"[aggressive-offload] Callback failed: {e}") logging.info(f"[aggressive-offload] Invalidated execution cache after destroying {len(_meta_destroy_queue)} model(s)")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@comfy/model_management.py` around lines 776 - 802, The loop invoking registered callbacks (_on_model_destroyed_callbacks) should guard each callback call so one failing callback doesn't stop the rest; change the for cb in _on_model_destroyed_callbacks: cb("batch") to call each cb("batch") inside a try/except that catches Exception, logs the failure (e.g., logging.exception or logging.warning with the exception) and then continues, ensuring all callbacks (including PromptExecutor) still run and the final logging.info still executes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@comfy/model_management.py`:
- Around line 776-802: The loop invoking registered callbacks
(_on_model_destroyed_callbacks) should guard each callback call so one failing
callback doesn't stop the rest; change the for cb in
_on_model_destroyed_callbacks: cb("batch") to call each cb("batch") inside a
try/except that catches Exception, logs the failure (e.g., logging.exception or
logging.warning with the exception) and then continues, ensuring all callbacks
(including PromptExecutor) still run and the final logging.info still executes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 328d2ed0-f541-4917-9626-7a126746a09d
📒 Files selected for processing (3)
comfy/model_management.pycomfy_execution/caching.pytests-unit/test_aggressive_offload.py
🚧 Files skipped from review as they are similar to previous changes (2)
- comfy_execution/caching.py
- tests-unit/test_aggressive_offload.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@comfy/model_management.py`:
- Around line 750-758: The aggressive-offload block can dereference a dead
weakref; before accessing
current_loaded_models[i].model.model.__class__.__name__ or calling
model_memory(), first retrieve the model reference into a local (e.g., model_ref
= current_loaded_models[i].model), check that model_ref is not None and that
getattr(model_ref, "model", None) is not None (and that currently_used still
False), and only then compute model_size_mb and log; if the weakref is dead,
skip/continue the offload branch to avoid AttributeError. Ensure checks occur
inside the AGGRESSIVE_OFFLOAD and VRAMState.SHARED branch and reference the same
current_loaded_models[i] entry consistently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 00b934cf-fe05-4da1-8e14-d54258db5827
📒 Files selected for processing (3)
comfy/model_management.pycomfy_execution/caching.pytests-unit/test_aggressive_offload.py
72977bb to
c2213f8
Compare
c2213f8 to
7ec3984
Compare
162b2ad to
0bd2a35
Compare
Aggressive Offload Benchmark — Apple Silicon (MPS)
Summary
Add
--aggressive-offloadCLI flag for Apple Silicon (MPS) systems withunified memory. Forces model parameter destruction and MPS allocator
flush between generation runs, eliminating swap pressure on disk.
Problem
On Apple Silicon, CPU RAM = GPU VRAM (unified memory). ComfyUI's default
memory management assumes models can be offloaded to CPU — but on MPS,
"offloaded" models still consume the same physical RAM. When running
large pipelines (e.g. FLUX.2 Dev 32B GGUF + Mistral 24B GGUF = ~42 GB), the
MPS allocator fragments the shared pool, forcing macOS to swap model
weights to SSD. This turns GPU compute into disk-bound I/O.
Hardware
Model Configuration
Generation Parameters
Results
Single Generation Latency
--aggressive-offloadMulti-Batch Stability
--aggressive-offloadCannot copy out of meta tensorinvae_decodeMemory Behaviour
--aggressive-offloadChanges
comfy/cli_args.py--aggressive-offloadflagcomfy/model_management.pycomfy/model_management.pycomfy/model_management.pycomfy/model_management.pycomfy/model_management.pyEXTRA_RESERVED_VRAM4 GB gated by flagcomfy/samplers.pyempty_cache()per stepcomfy_execution/caching.pyBasicCache.clear_all()public APIcomfy_execution/caching.pyNullCache.clear_all()no-opcomfy_execution/caching.pyLRUCache/RAMPressureCachemetadata resetexecution.pyTesting
clear_all()for all cache variants (Basic, LRU, RAMPressure, Null), lifecycle callbacks, meta-device threshold, MPS flush conditionality, and CLI flag wiring.Reproduction
Queue 4+ identical txt2img jobs at 896×1152, 20 steps, FLUX.2 Dev Q5_K_M.
Observe: time per image and whether batch completes without crash.
Conclusion
The
--aggressive-offloadflag resolves two critical issues on Apple Silicon:MPS allocator fragmentation via per-step cache flushes.
Cannot copy out of meta tensorcrashes byusing a size-based threshold (>1 GB) for aggressive model destruction, preserving
small models like the VAE that the execution cache depends on.
The patch is opt-in (
--aggressive-offload), has zero impact on non-MPS platforms,and is gated behind
VRAMState.SHAREDchecks throughout.