Skip to content

Dynamic VRAM support#427

Draft
rattus128 wants to merge 6 commits intocity96:mainfrom
rattus128:dynamic-vram
Draft

Dynamic VRAM support#427
rattus128 wants to merge 6 commits intocity96:mainfrom
rattus128:dynamic-vram

Conversation

@rattus128
Copy link
Copy Markdown
Contributor

@rattus128 rattus128 commented Mar 5, 2026

The new dynamic VRAM system in the comfy-core enhances both RAM and VRAM management. Models are no longer offloader from VRAM to RAM (which has a habit of becoming swap) and are now loadable asynchronously on the sampler first iteration. This gives significant speedup to big multi-model workflows on low-resource systems. VRAM offloading is managed by demand offloading, such there is no need to have VRAM usage esitmates anymore.

The core has already upstreamed several of the resource saving features of GGUF in various forms.

  • The core linear layers are now inited un-allocated to avoid the naked commit charge for the empty tensor.
  • Models are loaded with assign=True to avoid deep copy and committed memory on model load (GGUF does similar but with _load_state_dict hooking)
  • the sft file is mmaped read only to avoid that commit charge. GGUF does this

So this implements a QuantizedTensor backend and subclasses the new ModelPatcherDynamic to bring GGUF+dynamic without needed custom ops.

The patcher subclass is needed to unhook the lora into on-the-fly. Otherwise its just load the state-dict into the new QuantizedTensor and go.

This brings the full feature-set of the core comfy caster to GGUF including, async-offload (and async primary load), pinned-memory and now the dynamic management.

There's some boilerplate to implement downgrade back to ModelPatcher. This is needed for things like torch compiler and hooks where Dynamic VRAM is TBD.

Still drafing and will post some more performance results. I am going to pull a RAM stick and go for some 16GB RAM flows with GGUF.

Example Test conditions:

WAN2.2 14B Q8 GGUF, 640x640x81f, RTX5090, Linux, 96GB, 2x Runs (disk caches warm with model first runs)

Before

Prompt executed in 60.31 seconds
Prompt executed in 55.99 seconds

After

Prompt executed in 48.75 seconds
Prompt executed in 43.35 seconds

Vibe code. To be reviewed.
If in dynamic mode, load GGUF as a QT.
Refactor this to support the new reconstructability protocol in the
comfy core. This is needed for DynamicVRAM (to support legacy
demotion for fallbacks). Add the logic for dynamic_vram construction.

This is also needed for worksplit multi-gpu branch where the model
is deep-cloned via reconstruction to put the model on two parallel
GPUs.
Refactor this to support the new reconstructability protocol in the
comfy core. This is needed for DynamicVRAM (to support legacy
demotion for fallbacks). Add the logic for dynamic_vram construction.

This is also needed for worksplit multi-gpu branch where the model
is deep-cloned via reconstruction to put the model on two parallel
GPUs.
Factor this out to a helper and implement the new core reconstruction
protocol. Consider the mmap_released flag 1:1 with the underlying model
such that it moves with the base model in model_override.
@m8rr
Copy link
Copy Markdown

m8rr commented Mar 6, 2026

https://github.com/rattus128/ComfyUI-GGUF/tree/dynamic-vram

Is this the same thing?
I used the above and an error occurs when using CLIPLoader (GGUF) with GGUF


D:\AI\ComfyUI_windows_portable>.\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build --disable-api-nodes --output-directory E:\output --temp-directory E:\output
Setting output directory to: E:\output
Found comfy_kitchen backend triton: {'available': True, 'disabled': True, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8']}
Found comfy_kitchen backend eager: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend cuda: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Checkpoint files will always be loaded safely.
Total VRAM 12282 MB, total RAM 32085 MB
pytorch version: 2.10.0+cu130
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4070 SUPER : cudaMallocAsync
Using async weight offloading with 2 streams
Enabled pinned memory 14438.0
working around nvidia conv3d memory bug.
Using pytorch attention
aimdo: src-win/cuda-detour.c:77:INFO:aimdo_setup_hooks: found driver at 00007FFB60C00000, installing 4 hooks
aimdo: src-win/cuda-detour.c:61:DEBUG:install_hook_entrys: hooks successfully installed
aimdo: src/control.c:66:INFO:comfy-aimdo inited for GPU: NVIDIA GeForce RTX 4070 SUPER (VRAM: 12281 MB)
DynamicVRAM support detected and enabled
Python version: 3.13.9 (tags/v3.13.9:8183fa5, Oct 14 2025, 14:09:13) [MSC v.1944 64 bit (AMD64)]
ComfyUI version: 0.16.3
Setting temp directory to: E:\output\temp
ComfyUI frontend version: 1.39.19
[Prompt Server] web root: D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\comfyui_frontend_package\static
ComfyUI-GGUF: Allowing full torch compile

Import times for custom nodes:
   0.0 seconds: D:\AI\ComfyUI_windows_portable\ComfyUI\custom_nodes\websocket_image_save.py
   0.0 seconds: D:\AI\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-KJNodes
   0.1 seconds: D:\AI\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-GGUF

Context impl SQLiteImpl.
Will assume non-transactional DDL.
Assets scan(roots=['models']) completed in 0.056s (created=0, skipped_existing=81, orphans_pruned=0, total_seen=85)
Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
gguf qtypes: F32 (289), Q6_K (337)
Attempting to recreate sentencepiece tokenizer from GGUF file metadata...
Created tokenizer with vocab size of 262208
Dequantizing token_embd.weight to prevent runtime OOM.
clip missing: ['multi_modal_projector.mm_input_projection_weight', 
....
....
'vision_model.post_layernorm.weight', 'vision_model.post_layernorm.bias']
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load LTXAVTEModel_
Model LTXAVTEModel_ prepared for dynamic VRAM loading. 50881MB Staged. 0 patches attached. Force pre-loaded 290 weights: 2995 KB.
!!! Exception during processing !!! shape '[4096, 3840]' is invalid for input of size 12902400
Traceback (most recent call last):
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\execution.py", line 524, in execute
    output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, v3_data=v3_data)
                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\execution.py", line 333, in get_output_data
    return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, v3_data=v3_data)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\execution.py", line 307, in _async_map_node_over_list
    await process_inputs(input_dict, i)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\execution.py", line 295, in process_inputs
    result = f(**inputs)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\nodes.py", line 80, in encode
    return (clip.encode_from_tokens_scheduled(tokens), )
            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\sd.py", line 313, in encode_from_tokens_scheduled
    pooled_dict = self.encode_from_tokens(tokens, return_pooled=return_pooled, return_dict=True)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\sd.py", line 377, in encode_from_tokens
    o = self.cond_stage_model.encode_token_weights(tokens)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\text_encoders\lt.py", line 167, in encode_token_weights
    out, pooled, extra = self.gemma3_12b.encode_token_weights(token_weight_pairs)
                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\sd1_clip.py", line 45, in encode_token_weights
    o = self.encode(to_encode)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\sd1_clip.py", line 306, in encode
    return self(tokens)
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\sd1_clip.py", line 279, in forward
    outputs = self.transformer(None, attention_mask_model, embeds=embeds, num_tokens=num_tokens, intermediate_output=intermediate_output, final_layer_norm_intermediate=self.layer_norm_hidden_state, dtype=torch.float32, embeds_info=embeds_info)
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\text_encoders\llama.py", line 794, in forward
    return self.model(input_ids, *args, **kwargs)
           ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\text_encoders\llama.py", line 719, in forward
    x, current_kv = layer(
                    ~~~~~^
        x=x,
        ^^^^
    ...<3 lines>...
        past_key_value=past_kv,
        ^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\text_encoders\llama.py", line 605, in forward
    x, present_key_value = self.self_attn(
                           ~~~~~~~~~~~~~~^
        hidden_states=x,
        ^^^^^^^^^^^^^^^^
    ...<4 lines>...
        sliding_window=sliding_window,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\text_encoders\llama.py", line 466, in forward
    xq = self.q_proj(hidden_states)
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\ops.py", line 373, in forward
    return self.forward_comfy_cast_weights(*args, **kwargs)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\ops.py", line 365, in forward_comfy_cast_weights
    weight, bias, offload_stream = cast_bias_weight(self, input, offloadable=True)
                                   ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\ops.py", line 228, in cast_bias_weight
    return cast_bias_weight_with_vbar(s, dtype, device, bias_dtype, non_blocking, compute_dtype, want_requant)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\ops.py", line 148, in cast_bias_weight_with_vbar
    comfy.model_management.cast_to_gathered(xfer_source, pin)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\model_management.py", line 1204, in cast_to_gathered
    dest_views = comfy.memory_management.interpret_gathered_like(tensors, r)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\memory_management.py", line 71, in interpret_gathered_like
    actuals[attr] = gathered[offset:offset+size].view(dtype=template.dtype).view(template.shape)
                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
RuntimeError: shape '[4096, 3840]' is invalid for input of size 12902400

@m8rr
Copy link
Copy Markdown

m8rr commented Mar 15, 2026

This version definitely has a speed boost. However, if you're getting errors with the GGUF text encoder like me, try modifying the code as follows. Only the text encoder is operating the old way. it should serve as a good temporary workaround until the update.

nodes.py line 206~ (False->True)

def _load_gguf_clip_patcher(clip_paths, clip_type, disable_dynamic=True):
    return _load_gguf_clip(clip_paths, clip_type, disable_dynamic=disable_dynamic).patcher

def _load_gguf_clip(clip_paths, clip_type, disable_dynamic=True):

@kingp0dd
Copy link
Copy Markdown

This version definitely has a speed boost. However, if you're getting errors with the GGUF text encoder like me, try modifying the code as follows. Only the text encoder is operating the old way. it should serve as a good temporary workaround until the update.

nodes.py line 206~ (False->True)

def _load_gguf_clip_patcher(clip_paths, clip_type, disable_dynamic=True):
    return _load_gguf_clip(clip_paths, clip_type, disable_dynamic=disable_dynamic).patcher

def _load_gguf_clip(clip_paths, clip_type, disable_dynamic=True):

That means it's already working. How much% did it save you

@m8rr
Copy link
Copy Markdown

m8rr commented Mar 17, 2026

without --disable-dynamic-vram

Requested to load LTXAVTEModel_
loaded partially; 8523.00 MB usable, 556.58 MB loaded, 13574.77 MB offloaded, 7966.42 MB buffer reserved, lowvram patches: 0
Attempting to release mmap (267)
loaded partially; 8457.88 MB usable, 491.46 MB loaded, 13639.97 MB offloaded, 7966.42 MB buffer reserved, lowvram patches: 0
gguf qtypes: F32 (2672), BF16 (28), Q6_K (1744)
model weight dtype torch.bfloat16, manual cast: None
model_type FLUX
Requested to load LTXAV
Model LTXAV prepared for dynamic VRAM loading. 16918MB Staged. 0 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:29<00:00,  5.93s/it]
0 models unloaded.
Model LTXAV prepared for dynamic VRAM loading. 16918MB Staged. 0 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:35<00:00, 11.72s/it]
Requested to load AudioVAE
loaded completely; 1968.11 MB usable, 693.46 MB loaded, full load: True
Requested to load VideoVAE
0 models unloaded.
Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 164.28 seconds
Model LTXAV prepared for dynamic VRAM loading. 16918MB Staged. 0 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:24<00:00,  4.81s/it]
0 models unloaded.
Model LTXAV prepared for dynamic VRAM loading. 16918MB Staged. 0 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:35<00:00, 11.69s/it]
Requested to load AudioVAE
loaded completely; 1934.00 MB usable, 693.46 MB loaded, full load: True
0 models unloaded.
Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 90.59 seconds
Model LTXAV prepared for dynamic VRAM loading. 16918MB Staged. 0 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:23<00:00,  4.63s/it]
0 models unloaded.
Model LTXAV prepared for dynamic VRAM loading. 16918MB Staged. 0 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:35<00:00, 11.68s/it]
Requested to load AudioVAE
loaded completely; 1966.00 MB usable, 693.46 MB loaded, full load: True
0 models unloaded.
Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 77.65 seconds

with --disable-dynamic-vram

Requested to load LTXAVTEModel_
loaded partially; 8523.00 MB usable, 556.58 MB loaded, 13574.77 MB offloaded, 7966.42 MB buffer reserved, lowvram patches: 0
Attempting to release mmap (267)
loaded partially; 8457.88 MB usable, 491.46 MB loaded, 13639.97 MB offloaded, 7966.42 MB buffer reserved, lowvram patches: 0
gguf qtypes: F32 (2672), BF16 (28), Q6_K (1744)
model weight dtype torch.bfloat16, manual cast: None
model_type FLUX
Requested to load LTXAV
loaded partially; 9564.67 MB usable, 9525.25 MB loaded, 7689.63 MB offloaded, 39.42 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:30<00:00,  6.09s/it]
Unloaded partially: 620.48 MB freed, 8904.77 MB remains loaded, 39.42 MB buffer reserved, lowvram patches: 0
0 models unloaded.
Unloaded partially: 1287.84 MB freed, 7616.93 MB remains loaded, 39.47 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:39<00:00, 13.33s/it]
Requested to load AudioVAE
loaded completely; 2233.75 MB usable, 693.46 MB loaded, full load: True
Requested to load VideoVAE
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 1384.94 MB offloaded, 378.02 MB buffer reserved, lowvram patches: 0
Prompt executed in 193.22 seconds
Requested to load LTXAV
loaded partially; 9560.67 MB usable, 9521.25 MB loaded, 7693.63 MB offloaded, 39.42 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:32<00:00,  6.43s/it]
Unloaded partially: 616.48 MB freed, 8904.77 MB remains loaded, 39.42 MB buffer reserved, lowvram patches: 0
0 models unloaded.
Unloaded partially: 1301.00 MB freed, 7603.77 MB remains loaded, 39.47 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:42<00:00, 14.15s/it]
Requested to load AudioVAE
loaded completely; 2244.90 MB usable, 693.46 MB loaded, full load: True
Requested to load VideoVAE
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 1384.94 MB offloaded, 378.02 MB buffer reserved, lowvram patches: 0
Prompt executed in 97.68 seconds
Requested to load LTXAV
loaded partially; 9560.67 MB usable, 9521.25 MB loaded, 7693.63 MB offloaded, 39.42 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:31<00:00,  6.25s/it]
Unloaded partially: 616.48 MB freed, 8904.77 MB remains loaded, 39.42 MB buffer reserved, lowvram patches: 0
0 models unloaded.
Unloaded partially: 1301.00 MB freed, 7603.77 MB remains loaded, 39.47 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:41<00:00, 13.87s/it]
Requested to load AudioVAE
loaded completely; 2244.90 MB usable, 693.46 MB loaded, full load: True
Requested to load VideoVAE
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 1384.94 MB offloaded, 378.02 MB buffer reserved, lowvram patches: 0
Prompt executed in 95.82 seconds

@kingp0dd
Copy link
Copy Markdown

kingp0dd commented Mar 17, 2026 via email

@m8rr
Copy link
Copy Markdown

m8rr commented Mar 17, 2026

Check if the quant_ops.py file exists inside the ComfyUI-GGUF folder. If it’s not there, the installation wasn't done correctly.

I installed like this.
git clone -b dynamic-vram https://github.com/rattus128/ComfyUI-GGUF

@kingp0dd
Copy link
Copy Markdown

kingp0dd commented Mar 27, 2026

EDIT:

Sorry to bother. It works now after nuking my whole comfyui installation.


That's weird, i did exactly that but it's still not activating.

I have that file:

~/comfy/ComfyUI/custom_nodes/ComfyUI-GGUF$ ls
dequant.py   LICENSE    nodes.py  __pycache__     quant_ops.py  requirements.txt
__init__.py  loader.py  ops.py    pyproject.toml  README.md     tools
~/comfy/ComfyUI/custom_nodes/ComfyUI-GGUF$ git status
On branch dynamic-vram
Your branch is up to date with 'origin/dynamic-vram'.

nothing to commit, working tree clean

But when loading the GGUF Q4KM Wan2.2, i still can't see the Dynamic loading log:

Requested to load WAN21
loaded partially; 5395.22 MB usable, 5292.63 MB loaded, 4044.55 MB offloaded, 102.59 MB buffer reserved, lowvram patches: 0
Patching torch settings: torch.backends.cuda.matmul.allow_fp16_accumulation = True

I know dynamic loading is enabled in Comfyui because other models have that log:

Unloaded partially: 3616.85 MB freed, 1431.91 MB remains loaded, 276.90 MB buffer reserved, lowvram patches: 276
Model WanVAE prepared for dynamic VRAM loading. 242MB Staged. 0 patches attached. Force pre-loaded 52 weights: 28 KB.

I tried this both in v0.16.4 and the latest Comfy v.0.18.2

EDIT:

Sorry to bother. It works now after nuking my whole comfyui installation.

@kingp0dd
Copy link
Copy Markdown

I ran a few tests win Wan2.2 Q4KM GGUF. It seems that GGUF Dynamic VRAM is slower than non-dynamic:
These are all third or fourth runs. I also noticed that the CLIP model (which is Q5KM GGUF) is consistently not being loaded from cache no matter how many consecutive runs I try. RTXVideoSuperResolution node also finishes slower (100s) vs when using non-dynamic VRAM which finishes after only 2s.

Dynamic VRAM:

Requested to load WanTEModel
loaded completely; 5674.75 MB usable, 5129.20 MB loaded, full load: True
-----------------#29:1044 [CLIPTextEncode]: 14.30s - vram 5894787076b
0 models unloaded.
Requested to load WAN21
Model WAN21 prepared for dynamic VRAM loading. 9202MB Staged. 1054 patches attached.
Patching torch settings: torch.backends.cuda.matmul.allow_fp16_accumulation = True
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:33<00:00, 17.00s/it]
Patching torch settings: torch.backends.cuda.matmul.allow_fp16_accumulation = False
-----------------#29:961 [KSamplerAdvanced]: 64.11s - vram 20b

Warning: TAESD previews enabled, but could not find models/vae_approx/lighttaew2_1
0 models unloaded.
Model WAN21 prepared for dynamic VRAM loading. 9202MB Staged. 1054 patches attached.
Patching torch settings: torch.backends.cuda.matmul.allow_fp16_accumulation = True
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:34<00:00, 17.05s/it]
Patching torch settings: torch.backends.cuda.matmul.allow_fp16_accumulation = False
-----------------#29:968 [KSamplerAdvanced]: 45.32s - vram 3356487796b


Requested to load WanTEModel


loaded completely; 5672.75 MB usable, 5129.20 MB loaded, full load: True
-----------------#29:1044 [CLIPTextEncode]: 22.36s - vram 5894787076b
0 models unloaded.
Model WanVAE prepared for dynamic VRAM loading. 242MB Staged. 0 patches attached. Force pre-loaded 52 weights: 28 KB.
-----------------#29:960 [WanImageToVideo]: 7.20s - vram 2374470160b
-----------------#29:1076 [CFGZeroStar]: 0.00s - vram 0b
-----------------#29:1276 [PrimitiveInt]: 0.00s - vram 0b
Warning: TAESD previews enabled, but could not find models/vae_approx/lighttaew2_1
Requested to load WAN21
Model WAN21 prepared for dynamic VRAM loading. 9202MB Staged. 1054 patches attached.
Patching torch settings: torch.backends.cuda.matmul.allow_fp16_accumulation = True
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:32<00:00, 16.47s/it]
Patching torch settings: torch.backends.cuda.matmul.allow_fp16_accumulation = False
-----------------#29:961 [KSamplerAdvanced]: 74.66s - vram 20b
Warning: TAESD previews enabled, but could not find models/vae_approx/lighttaew2_1
Requested to load WAN21
0 models unloaded.
Model WAN21 prepared for dynamic VRAM loading. 9202MB Staged. 1054 patches attached.
Patching torch settings: torch.backends.cuda.matmul.allow_fp16_accumulation = True
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:33<00:00, 16.67s/it]
Patching torch settings: torch.backends.cuda.matmul.allow_fp16_accumulation = False
-----------------#29:968 [KSamplerAdvanced]: 61.17s - vram 3358965876b

Without Dynamic VRAM:

Requested to load WAN21
loaded partially; 5661.19 MB usable, 5558.60 MB loaded, 3778.57 MB offloaded, 102.59 MB buffer reserved, lowvram patches: 0
Patching torch settings: torch.backends.cuda.matmul.allow_fp16_accumulation = True
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:35<00:00, 17.85s/it]
Patching torch settings: torch.backends.cuda.matmul.allow_fp16_accumulation = False
-----------------#29:961 [KSamplerAdvanced]: 39.28s - vram 2035224048b
Warning: TAESD previews enabled, but could not find models/vae_approx/lighttaew2_1
Requested to load WAN21
loaded partially; 5661.19 MB usable, 5558.60 MB loaded, 3778.57 MB offloaded, 102.59 MB buffer reserved, lowvram patches: 0
Patching torch settings: torch.backends.cuda.matmul.allow_fp16_accumulation = True
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:35<00:00, 17.87s/it]
Patching torch settings: torch.backends.cuda.matmul.allow_fp16_accumulation = False
-----------------#29:968 [KSamplerAdvanced]: 38.92s - vram 3326974068b
-----------------#1323 [PrimitiveInt]: 0.00s - vram 0b
-----------------#1321 [SomethingToString]: 0.00s - vram 0b
-----------------#1320 [XDateTimeString]: 0.00s - vram 0b
-----------------#1317 [StringFunction|pysssss]: 0.00s - vram 0b
-----------------#1324 [PreviewAny]: 0.00s - vram 0b
-----------------#29:1307 [XLatentSave]: 0.01s - vram 0b
-----------------#29:1285:1280 [easy ifElse]: 0.00s - vram 0b
Requested to load WanVAE
Model WanVAE prepared for dynamic VRAM loading. 242MB Staged. 0 patches attached. Force pre-loaded 52 weights: 28 KB.
-----------------#29:963 [VAEDecode]: 10.03s - vram 3296513680b
-----------------#29:1263 [Context (rgthree)]: 0.00s - vram 0b
-----------------#29:1265 [Context Switch (rgthree)]: 0.00s - vram 0b
-----------------#29:1285:1280 [easy ifElse]: 0.00s - vram 0b
-----------------#29:1284:1099 [easy ifElse]: 0.00s - vram 0b
-----------------#1340 [RTXVideoSuperResolution]: 2.24s - vram 357212160b
-----------------#28 [VHS_VideoCombine]: 3.59s - vram 0b
-----------------#9 [VHS_PruneOutputs]: 0.00s - vram 0b
Prompt executed in 108.08 seconds

@m8rr
Copy link
Copy Markdown

m8rr commented Mar 27, 2026

I ran Wan2.2 three times. Unlike LTX2.3, there wasn't a huge difference, but it doesn't seem any slower either.

Total VRAM 12282 MB, total RAM 32085 MB
pytorch version: 2.11.0+cu130
Enabled fp16 accumulation.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4070 SUPER : cudaMallocAsync
Using async weight offloading with 2 streams
Enabled pinned memory 14438.0
Using sage attention
aimdo: src-win/cuda-detour.c:77:INFO:aimdo_setup_hooks: found driver at 00007FFD38C80000, installing 4 hooks
aimdo: src-win/cuda-detour.c:61:DEBUG:install_hook_entrys: hooks successfully installed
aimdo: src/control.c:69:INFO:comfy-aimdo inited for GPU: NVIDIA GeForce RTX 4070 SUPER (VRAM: 12281 MB)
DynamicVRAM support detected and enabled
Python version: 3.13.9 (tags/v3.13.9:8183fa5, Oct 14 2025, 14:09:13) [MSC v.1944 64 bit (AMD64)]
ComfyUI version: 0.18.2
comfy-aimdo version: 0.2.12
comfy-kitchen version: 0.2.8
ComfyUI frontend version: 1.43.7
gguf qtypes: F16 (694), Q4_K (356), Q5_K (44), F32 (1)
model weight dtype torch.float16, manual cast: None
model_type FLOW
Requested to load WAN21
Model WAN21 prepared for dynamic VRAM loading. 8339MB Staged. 400 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00,  5.09s/it]
gguf qtypes: F16 (694), Q4_K (356), Q5_K (44), F32 (1)
model weight dtype torch.float16, manual cast: None
model_type FLOW
Requested to load WAN21
0 models unloaded.
Model WAN21 prepared for dynamic VRAM loading. 8339MB Staged. 400 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:39<00:00, 49.68s/it]
Requested to load WanVAE
0 models unloaded.
Model WanVAE prepared for dynamic VRAM loading. 242MB Staged. 0 patches attached. Force pre-loaded 52 weights: 28 KB.
Prompt executed in 183.19 seconds
Model WAN21 prepared for dynamic VRAM loading. 8339MB Staged. 400 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00,  5.07s/it]
0 models unloaded.
Model WAN21 prepared for dynamic VRAM loading. 8339MB Staged. 400 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:38<00:00, 49.45s/it]
0 models unloaded.
Model WanVAE prepared for dynamic VRAM loading. 242MB Staged. 0 patches attached. Force pre-loaded 52 weights: 28 KB.
Prompt executed in 156.86 seconds
Model WAN21 prepared for dynamic VRAM loading. 8339MB Staged. 400 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00,  5.11s/it]
0 models unloaded.
Model WAN21 prepared for dynamic VRAM loading. 8339MB Staged. 400 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:38<00:00, 49.48s/it]
0 models unloaded.
Model WanVAE prepared for dynamic VRAM loading. 242MB Staged. 0 patches attached. Force pre-loaded 52 weights: 28 KB.
Prompt executed in 140.45 seconds
gguf qtypes: F16 (694), Q4_K (356), Q5_K (44), F32 (1)
model weight dtype torch.float16, manual cast: None
model_type FLOW
Requested to load WAN21
loaded partially; 8181.12 MB usable, 8103.49 MB loaded, 370.42 MB offloaded, 96.28 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00,  6.26s/it]
gguf qtypes: F16 (694), Q4_K (356), Q5_K (44), F32 (1)
model weight dtype torch.float16, manual cast: None
model_type FLOW
Requested to load WAN21
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 8473.91 MB offloaded, 527.89 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:41<00:00, 50.55s/it]
Requested to load WanVAE
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 242.00 MB offloaded, 22.78 MB buffer reserved, lowvram patches: 0
Prompt executed in 205.24 seconds
Requested to load WAN21
loaded partially; 8170.12 MB usable, 8089.41 MB loaded, 384.49 MB offloaded, 96.28 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:13<00:00,  6.80s/it]
Requested to load WAN21
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 8473.91 MB offloaded, 527.89 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:40<00:00, 50.42s/it]
Requested to load WanVAE
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 242.00 MB offloaded, 22.78 MB buffer reserved, lowvram patches: 0
Prompt executed in 155.54 seconds
Requested to load WAN21
loaded partially; 8170.12 MB usable, 8089.41 MB loaded, 384.49 MB offloaded, 96.28 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:11<00:00,  5.61s/it]
Requested to load WAN21
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 8473.91 MB offloaded, 527.89 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:40<00:00, 50.44s/it]
Requested to load WanVAE
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 242.00 MB offloaded, 22.78 MB buffer reserved, lowvram patches: 0
Prompt executed in 144.12 seconds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants