Skip to content

Threaded Loader performance fixes / improvements (+ Aimdo 0.4.6)#14116

Draft
rattus128 wants to merge 11 commits into
Comfy-Org:masterfrom
rattus128:prs/aimdo-046-threaded-loader-2
Draft

Threaded Loader performance fixes / improvements (+ Aimdo 0.4.6)#14116
rattus128 wants to merge 11 commits into
Comfy-Org:masterfrom
rattus128:prs/aimdo-046-threaded-loader-2

Conversation

@rattus128
Copy link
Copy Markdown
Contributor

@rattus128 rattus128 commented May 26, 2026

A handful of RAM optimizations particularly on windows with slow disks.

Dismantle the stream-pin-buffer and instead aimdo 0.4.6 has a direct file -> VRAM load API using the same threaded load but with a static ring buffer that matches the chunk size and does coalescence in C. This saves a lot of RAM and also avoids prefault delay for larger stream-pin-buffer allocation which skirting the giant-weight WRT RAM.

From there, change the pin allocation and movement strategy to always max out pin allocation on the current model even if there isnt enough reservation quota. Instead move pins on the fly (taking the cuda sync hit) as that is preferable to risking a disk hit or having to do a RAM deep copy. The MRU 2GB chunk gets evicted repeatedly and rotated through the shortfall to avoid LRU all-weights eviction as the transformer cycles everything.

De-committing memory for the sake of pin buffer freeing is made lightly asynchronous to get this out of the CPU main thread critical path.

pinned memory is improved with a offload balancer algorithm. A max scatter algorithm is used to spread out the weights that miss out on getting loaded to RAM so disk bandwidth can be maximized by evening out the load.

Aimdo 0.4.7 improved VRAM load patterns by not loading past the VRAM usage accounting all yet-to-be-loaded pages. this avoid a disk revisit for these weights.

Finally fix the file open mode in windows and unify with the aimdo open which make disks just a little faster on Win.

Example test conditions:

Windows, RTX5060, 32GB RAM, PCIE x4 Gen1 (downgraded)
LTX2.3 960x540x10s

scr

Before:

[INFO] got prompt
[INFO] CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[INFO] Requested to load LTXAVTEModel_
[INFO] Model LTXAVTEModel_ prepared for dynamic VRAM loading. 25440MB Staged. 0 patches attached. Force pre-loaded 400 weights: 1745 KB.
[INFO] VAE load device: cuda:0, offload device: cpu, dtype: torch.float32
[INFO] Found quantization metadata version 1
[INFO] Detected mixed precision quantization
[INFO] Using mixed precision operations
[INFO] Native ops: nvfp4, int8_blockwise, float8_e4m3fn_rowwise, mxfp8, hybrid_mxfp8, float8_e5m2, float8_e4m3fn_blockwise, float8_e4m3fn, int8_tensorwise
[INFO] model weight dtype torch.bfloat16, manual cast: torch.bfloat16
[INFO] model_type FLUX
[INFO] VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
[WARNING] no CLIP/text encoder weights in checkpoint, the text encoder model will not be loaded.
[INFO] 0 models unloaded.
[INFO] Model LTXAVTEModel_ prepared for dynamic VRAM loading. 25440MB Staged. 0 patches attached. Force pre-loaded 400 weights: 1745 KB.
[INFO] Requested to load LTXAV
[INFO] Model LTXAV prepared for dynamic VRAM loading. 23835MB Staged. 1660 patches attached. Force pre-loaded 2104 weights: 3308 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [03:44<00:00, 28.03s/it]
[INFO] Model LTXAV prepared for dynamic VRAM loading. 23835MB Staged. 1660 patches attached. Force pre-loaded 2104 weights: 3308 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [02:01<00:00, 40.46s/it]
[INFO] Requested to load AudioVAE
[INFO] loaded completely;  693.46 MB loaded, full load: True
[INFO] Requested to load VideoVAE
[INFO] 0 models unloaded.
[INFO] Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
[INFO] Prompt executed in 463.97 seconds
scr

After:

[INFO] got prompt
[INFO] CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[INFO] Requested to load LTXAVTEModel_
[INFO] Model LTXAVTEModel_ prepared for dynamic VRAM loading. 25440MB Staged. 0 patches attached. Force pre-loaded 400 weights: 1745 KB.
[INFO] VAE load device: cuda:0, offload device: cpu, dtype: torch.float32
[INFO] Found quantization metadata version 1
[INFO] Detected mixed precision quantization
[INFO] Using mixed precision operations
[INFO] Native ops: float8_e4m3fn_rowwise, float8_e4m3fn_blockwise, nvfp4, int8_tensorwise, int8_blockwise, mxfp8, float8_e5m2, float8_e4m3fn, hybrid_mxfp8
[INFO] model weight dtype torch.bfloat16, manual cast: torch.bfloat16
[INFO] model_type FLUX
[INFO] VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
[WARNING] no CLIP/text encoder weights in checkpoint, the text encoder model will not be loaded.
[INFO] 0 models unloaded.
[INFO] Model LTXAVTEModel_ prepared for dynamic VRAM loading. 25440MB Staged. 0 patches attached. Force pre-loaded 400 weights: 1745 KB.
[INFO] Requested to load LTXAV
[INFO] Model LTXAV prepared for dynamic VRAM loading. 23835MB Staged. 1660 patches attached. Force pre-loaded 2104 weights: 3308 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [01:43<00:00, 12.94s/it]
[INFO] Model LTXAV prepared for dynamic VRAM loading. 23835MB Staged. 1660 patches attached. Force pre-loaded 2104 weights: 3308 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [01:04<00:00, 21.59s/it]
[INFO] Requested to load AudioVAE
[INFO] loaded completely;  693.46 MB loaded, full load: True
[INFO] Requested to load VideoVAE
[INFO] 0 models unloaded.
[INFO] Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
[INFO] Prompt executed in 277.78 seconds
scr

v0.22.0:

Model LTXAV prepared for dynamic VRAM loading. 23838MB Staged. 1660 patches attached. Force pre-loaded 1496 weights: 44 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [01:53<00:00, 14.16s/it]
Model LTXAV prepared for dynamic VRAM loading. 23838MB Staged. 1660 patches attached. Force pre-loaded 1496 weights: 44 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [01:05<00:00, 21.95s/it]
Requested to load AudioVAE
loaded completely;  693.46 MB loaded, full load: True
Requested to load VideoVAE
0 models unloaded.
Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 361.52 seconds

After + #13971:

[INFO] Requested to load LTXAV
[INFO] Model LTXAV prepared for dynamic VRAM loading. 23835MB Staged. 1660 patches attached. Force pre-loaded 2104 weights: 3308 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [01:09<00:00,  8.71s/it]
[INFO] Model LTXAV prepared for dynamic VRAM loading. 23835MB Staged. 1660 patches attached. Force pre-loaded 2104 weights: 3308 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:46<00:00, 15.65s/it]
[INFO] Requested to load AudioVAE
[INFO] loaded completely;  693.46 MB loaded, full load: True
[INFO] Requested to load VideoVAE
[INFO] 0 models unloaded.
[INFO] Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
[INFO] Prompt executed in 239.95 seconds

@socket-security
Copy link
Copy Markdown

socket-security Bot commented May 26, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Updatedcomfy-aimdo@​0.4.5 ⏵ 0.4.799 +110010010070

View full report

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented May 26, 2026

Whole thing is still net negative for me.

#13802 (comment)

v22 dynamic vram disabled


INT8 Grouped LoRA: Stacked 4 LoRAs: klein_snofs_v1_3.safetensors, lenovo_flux_klein9b.safetensors, nicegirls_flux_klein9b.safetensors, Realism_Engine_Klein_V2.safetensors
gguf qtypes: Q6_K (37), F32 (145), Q4_K (217)
Dequantizing token_embd.weight to prevent runtime OOM.
[MultiGPU Core Patching] text_encoder_device_patched returning device: cuda:0 (current_text_encoder_device=cuda:0)
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
loaded completely; 21002.17 MB usable, 6829.34 MB loaded, full load: True
Requested to load Flux2
loaded completely; 14564.04 MB usable, 8996.02 MB loaded, full load: True
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:07<00:00,  1.86s/it]
Requested to load TAESD
loaded completely; 4994.88 MB usable, 10.21 MB loaded, full load: True
Prompt executed in 33.81 seconds << compile (auto, no node)
got prompt
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.47s/it]
Prompt executed in 6.11 seconds << reroll

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.46s/it]
Prompt executed in 7.19 seconds << new prompt
got prompt

master dynamic vram enabled


INFO] INT8 Grouped LoRA: Stacked 4 LoRAs: klein_snofs_v1_3.safetensors, lenovo_flux_klein9b.safetensors, nicegirls_flux_klein9b.safetensors, Realism_Engine_Klein_V2.safetensors
[INFO] gguf qtypes: Q6_K (37), F32 (145), Q4_K (217)
[MINIMAL] Dequantizing token_embd.weight to prevent runtime OOM.
[MultiGPU Core Patching] text_encoder_device_patched returning device: cuda:0 (current_text_encoder_device=cuda:0)
[INFO] CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[INFO] Requested to load Flux2TEModel_
[INFO] loaded completely; 21002.17 MB usable, 6829.34 MB loaded, full load: True
[INFO] Requested to load Flux2
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.47s/it]
[INFO] Requested to load TAESD
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 30.20 seconds << (compile)
[INFO] got prompt
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.48s/it]
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 6.27 seconds << (slower reroll)
[INFO] got prompt
[INFO] Requested to load Flux2TEModel_
[INFO] loaded completely; 11899.98 MB usable, 6829.34 MB loaded, full load: True
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.48s/it]
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 10.38 seconds << (owo, what's this?)

PR

[INFO] INT8 Grouped LoRA: Stacked 4 LoRAs: klein_snofs_v1_3.safetensors, lenovo_flux_klein9b.safetensors, nicegirls_flux_klein9b.safetensors, Realism_Engine_Klein_V2.safetensors
[INFO] gguf qtypes: Q6_K (37), F32 (145), Q4_K (217)
[MINIMAL] Dequantizing token_embd.weight to prevent runtime OOM.
[MultiGPU Core Patching] text_encoder_device_patched returning device: cuda:0 (current_text_encoder_device=cuda:0)
[INFO] CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[INFO] Requested to load Flux2TEModel_
[INFO] loaded completely; 21002.17 MB usable, 6829.34 MB loaded, full load: True
[INFO] Requested to load Flux2
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.47s/it]
[INFO] Requested to load TAESD
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 31.16 seconds
[INFO] got prompt
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.47s/it]
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 6.22 seconds
[INFO] got prompt
[INFO] Requested to load Flux2TEModel_
[INFO] loaded completely; 11899.98 MB usable, 6829.34 MB loaded, full load: True
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.48s/it]
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 10.90 seconds

PR --cache-ram 1


[INFO] INT8 Grouped LoRA: Stacked 4 LoRAs: klein_snofs_v1_3.safetensors, lenovo_flux_klein9b.safetensors, nicegirls_flux_klein9b.safetensors, Realism_Engine_Klein_V2.safetensors
[INFO] gguf qtypes: Q6_K (37), F32 (145), Q4_K (217)
[MINIMAL] Dequantizing token_embd.weight to prevent runtime OOM.
[MultiGPU Core Patching] text_encoder_device_patched returning device: cuda:0 (current_text_encoder_device=cuda:0)
[INFO] CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[INFO] Requested to load Flux2TEModel_
[INFO] loaded completely; 21002.17 MB usable, 6829.34 MB loaded, full load: True
[INFO] Requested to load Flux2
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.48s/it]
[INFO] Requested to load TAESD
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 30.03 seconds
[INFO] got prompt
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.48s/it]
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 6.24 seconds
[INFO] got prompt
[INFO] Requested to load Flux2TEModel_
[INFO] loaded completely; 11899.98 MB usable, 6829.34 MB loaded, full load: True
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.48s/it]
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 10.91 seconds

@alldtes9-tech
Copy link
Copy Markdown

alldtes9-tech commented May 27, 2026

Whole thing is still net negative for me.


[INFO] INT8 Grouped LoRA: Stacked 4 LoRAs: klein_snofs_v1_3.safetensors, lenovo_flux_klein9b.safetensors, nicegirls_flux_klein9b.safetensors, Realism_Engine_Klein_V2.safetensors
[INFO] gguf qtypes: Q6_K (37), F32 (145), Q4_K (217)
[MINIMAL] Dequantizing token_embd.weight to prevent runtime OOM.
[MultiGPU Core Patching] text_encoder_device_patched returning device: cuda:0 (current_text_encoder_device=cuda:0)
[INFO] CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[INFO] Requested to load Flux2TEModel_
[INFO] loaded completely; 21002.17 MB usable, 6829.34 MB loaded, full load: True
[INFO] Requested to load Flux2
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.48s/it]
[INFO] Requested to load TAESD
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 30.03 seconds
[INFO] got prompt
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.48s/it]
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 6.24 seconds
[INFO] got prompt
[INFO] Requested to load Flux2TEModel_
[INFO] loaded completely; 11899.98 MB usable, 6829.34 MB loaded, full load: True
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.48s/it]
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 10.91 seconds

You're using INT8 quant, which isn't natively supported in Comfy. AFAIK, you need the Comfy Kitchen fork and a custom node to make it work.

Also, you're using GGUF for the text encoder, which I don't think supports Dynamic vram yet. rattus has a draft PR on the ComfyUI-GGUF repo, but I don't know if it's a complete implementation yet since it's still marked as draft.

Why not compare performance using quant that are natively supported in Comfy instead?

Edit:

Here are my results using quants that supported in Comfy, using zimage BF16 + Qwen3 4B BF16.

Master with --disable-dynamic-vram args.

[INFO] got prompt
[INFO] Using pytorch attention in VAE
[INFO] Using pytorch attention in VAE
[INFO] VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
[INFO] model weight dtype torch.bfloat16, manual cast: None
[INFO] model_type FLOW
[INFO] CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[INFO] Requested to load ZImageTEModel_
[INFO] loaded partially; 5677.80 MB usable, 5437.25 MB loaded, 2235.00 MB offloaded, 237.50 MB buffer reserved, lowvram patches: 0
[INFO] 0 models unloaded.
[INFO] Unloaded partially: 277.87 MB freed, 5159.38 MB remains loaded, 237.50 MB buffer reserved, lowvram patches: 0
[WARNING] [FeatureInjLatent] Reference latent: shape=torch.Size([1, 16, 90, 68])
[INFO] Requested to load Lumina2
[INFO] loaded partially; 5621.67 MB usable, 5245.48 MB loaded, 6494.06 MB offloaded, 375.00 MB buffer reserved, lowvram patches: 100
  0%|                                                                                            | 0/4 [00:00<?, ?it/s][WARNING] [FeatureInjLatent] step=1 | progress=0.00 | eff_str=0.150 | no mask
 25%|█████████████████████                                                               | 1/4 [00:04<00:12,  4.03s/it][WARNING] [FeatureInjLatent] step=2 | progress=0.03 | eff_str=0.143 | no mask
 50%|██████████████████████████████████████████                                          | 2/4 [00:05<00:05,  2.64s/it][WARNING] [FeatureInjLatent] step=3 | progress=0.06 | eff_str=0.135 | no mask
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:09<00:00,  2.33s/it]
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] loaded partially; 4710.57 MB usable, 4334.25 MB loaded, 7405.29 MB offloaded, 375.00 MB buffer reserved, lowvram patches: 112
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:36<00:00,  4.53s/it]
[INFO] Requested to load AutoencodingEngine
[INFO] 0 models unloaded.
[INFO] loaded partially; 0.00 MB usable, 0.00 MB loaded, 159.87 MB offloaded, 13.50 MB buffer reserved, lowvram patches: 0
[INFO] Prompt executed in 74.20 seconds <<<< 1st run
[INFO] got prompt
[INFO] Requested to load Lumina2
[INFO] loaded partially; 5612.67 MB usable, 5237.67 MB loaded, 6501.87 MB offloaded, 375.00 MB buffer reserved, lowvram patches: 100
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:07<00:00,  1.78s/it]
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] loaded partially; 4733.45 MB usable, 4358.45 MB loaded, 7381.09 MB offloaded, 375.00 MB buffer reserved, lowvram patches: 111
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:36<00:00,  4.56s/it]
[INFO] Requested to load AutoencodingEngine
[INFO] 0 models unloaded.
[INFO] loaded partially; 0.00 MB usable, 0.00 MB loaded, 159.87 MB offloaded, 13.50 MB buffer reserved, lowvram patches: 0
[INFO] Prompt executed in 50.51 seconds <<<< 2nd run (change seed)
[INFO] got prompt
[INFO] Requested to load ZImageTEModel_
[INFO] loaded partially; 5612.68 MB usable, 5374.75 MB loaded, 2297.50 MB offloaded, 237.50 MB buffer reserved, lowvram patches: 0
[INFO] Requested to load Lumina2
[INFO] loaded partially; 5612.67 MB usable, 5237.67 MB loaded, 6501.87 MB offloaded, 375.00 MB buffer reserved, lowvram patches: 100
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:07<00:00,  1.80s/it]
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] loaded partially; 4735.93 MB usable, 4360.93 MB loaded, 7378.61 MB offloaded, 375.00 MB buffer reserved, lowvram patches: 111
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:35<00:00,  4.46s/it]
[INFO] Requested to load AutoencodingEngine
[INFO] 0 models unloaded.
[INFO] loaded partially; 0.00 MB usable, 0.00 MB loaded, 159.87 MB offloaded, 13.50 MB buffer reserved, lowvram patches: 0
[INFO] Prompt executed in 53.89 seconds <<<< 3rd run (change prompt)

master with dynamic vram enabled (default)

[INFO] got prompt
[INFO] Using pytorch attention in VAE
[INFO] Using pytorch attention in VAE
[INFO] VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
[INFO] model weight dtype torch.bfloat16, manual cast: None
[INFO] model_type FLOW
[INFO] CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[INFO] Requested to load ZImageTEModel_
[INFO] Model ZImageTEModel_ prepared for dynamic VRAM loading. 7671MB Staged. 0 patches attached. Force pre-loaded 145 weights: 383 KB.
[INFO] 0 models unloaded.
[INFO] Model ZImageTEModel_ prepared for dynamic VRAM loading. 7671MB Staged. 0 patches attached. Force pre-loaded 145 weights: 383 KB.
[WARNING] [FeatureInjLatent] Reference latent: shape=torch.Size([1, 16, 90, 68])
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] Model Lumina2 prepared for dynamic VRAM loading. 11738MB Staged. 166 patches attached. Force pre-loaded 205 weights: 1045 KB.
  0%|                                                                | 0/4 [00:00<?, ?it/s,   Model Initializing ...  ][WARNING] [FeatureInjLatent] step=1 | progress=0.00 | eff_str=0.150 | no mask
 25%|████████████▎                                    | 1/4 [00:07<00:23,  7.73s/it,  Model Initialization complete!  ][WARNING] [FeatureInjLatent] step=2 | progress=0.03 | eff_str=0.143 | no mask
 50%|██████████████████████████████████████████                                          | 2/4 [00:02<00:02,  1.20s/it][WARNING] [FeatureInjLatent] step=3 | progress=0.06 | eff_str=0.135 | no mask
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.14s/it]
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] Model Lumina2 prepared for dynamic VRAM loading. 11738MB Staged. 166 patches attached. Force pre-loaded 205 weights: 1045 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:23<00:00,  3.00s/it]
[INFO] Requested to load AutoencodingEngine
[INFO] 0 models unloaded.
[INFO] Model AutoencodingEngine prepared for dynamic VRAM loading. 159MB Staged. 0 patches attached. Force pre-loaded 108 weights: 182 KB.
[INFO] Prompt executed in 42.62 seconds <<<< 1st run
[INFO] got prompt
[INFO] Requested to load Lumina2
[INFO] Model Lumina2 prepared for dynamic VRAM loading. 11738MB Staged. 166 patches attached. Force pre-loaded 205 weights: 1045 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.10s/it]
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] Model Lumina2 prepared for dynamic VRAM loading. 11738MB Staged. 166 patches attached. Force pre-loaded 205 weights: 1045 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:22<00:00,  2.86s/it]
[INFO] 0 models unloaded.
[INFO] Model AutoencodingEngine prepared for dynamic VRAM loading. 159MB Staged. 0 patches attached. Force pre-loaded 108 weights: 182 KB.
[INFO] Prompt executed in 32.90 seconds <<<< 2nd run (change seed)
[INFO] got prompt
[INFO] Model ZImageTEModel_ prepared for dynamic VRAM loading. 7671MB Staged. 0 patches attached. Force pre-loaded 145 weights: 383 KB.
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] Model Lumina2 prepared for dynamic VRAM loading. 11738MB Staged. 166 patches attached. Force pre-loaded 205 weights: 1045 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.12s/it]
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] Model Lumina2 prepared for dynamic VRAM loading. 11738MB Staged. 166 patches attached. Force pre-loaded 205 weights: 1045 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:22<00:00,  2.86s/it]
[INFO] 0 models unloaded.
[INFO] Model AutoencodingEngine prepared for dynamic VRAM loading. 159MB Staged. 0 patches attached. Force pre-loaded 108 weights: 182 KB.
[INFO] Prompt executed in 35.02 seconds <<<< 3rd run (change prompt)

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented May 27, 2026

You're using INT8 quant, which isn't natively supported in Comfy. AFAIK, you need the Comfy Kitchen fork and a custom node to make it work.

Because this used to work for me. I am comparing the thing I actually want to use not some theoretical. One could ask "why not compare the results on a system with an 8gb GPU and 16gb of ram. Why not compare SDXL, etc.

Master with --disable-dynamic-vram args.

Can't do that anymore because on master dynamic vram can no longer be properly disabled. The last PR that caused this disk-reloading behavior functionally deprecated this option. I can give you a log of .22 vs master if you'd like. You cannot induce a regression and then say that your "fix" makes it better.

@alldtes9-tech
Copy link
Copy Markdown

Because this used to work for me. I am comparing the thing I actually want to use not some theoretical. One could ask "why not compare the results on a system with an 8gb GPU and 16gb of ram. Why not compare SDXL, etc.

I'm not asking you to run a different model or use a different GPU or memory setup. I was asking why not compare using quants that are natively supported in Comfy.

If Comfy makes changes that improve performance for the native path and those changes end up affecting performance in your custom setup, I don't think it's fair to immediately conclude that Comfy introduced a performance regression just because a custom integration becomes slower.

Since you're using gguf/INT8 through a custom node / Comfy Kitchen fork, it might also worth ask with those maintainers whether latest master introduced changes they need to adapt to.

Personally, I mostly use models and quants that work through native Comfy paths, and I've generally seen benefits from recent changes.

That said, if the slowdown also happens on native setups, then that's a different discussion.

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented May 27, 2026

There is literally no scenario where loading from disk will help me because my disks are slow and I have plenty of ram just for this reason.

I was asking why not compare using quants that are natively supported in Comfy

Because those quants don't work for me. If they had, I wouldn't have sought different ones. I am having to re-invent the wheel and most likely fix those things myself, they are not the only broken nodes from this change. Prior to #13802 I was able to turn it off for my use, and you were able to leave it on for your use. We could both have our cake.

@rattus128 rattus128 force-pushed the prs/aimdo-046-threaded-loader-2 branch from 1eeb963 to bf0ac49 Compare May 28, 2026 09:19
@silveroxides
Copy link
Copy Markdown
Contributor

@Ph0rk0z Does my QuantOps no longer work for you?

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented May 28, 2026

I don't think it's the node because I have similar behavior with int8 TE and gguf TE. Both quantops and bobjohnson. I even went and tried the PR to comfy-gguf and there is still lost speed.

At this point it's impossible to post-compile int8 and I think GGUF because it will re-compile every run. I have to fire up ye old nunchaku and fp8 models too to see what's going on there. Have been using klein like in the logs for last 2 months, assuming things will keep working.

@brendanhoar
Copy link
Copy Markdown
Contributor

brendanhoar commented May 29, 2026

Lenovo Thinkstation P620. 512GB RAM. 64-core/128-thread Threadripper Pro 3995WX. 4090 24GB at PCIe 4.0. Windows 11, missing the most recent major service pack (as it does not play happy with the Thinkstation). pytorch cu128, I think. NVidia driver version: 591.86. ComfyUI up to date with master here, with and without this PR.

While various Flux.1/Chroma models (e.g. ~17GB) load out of OS cache almost instantaneously...LTX-dev (~40GB+) does not.

I'm using another program to read the entire LTX-2.3-22b-dev.safetensors model, which as a side effect loads the entire model into the OS Cache. Done this test hosting the file on various drives (NVME, HDD, encrypted, non-encrypted, etc.).

I think the biggest issue I see is the hiccups in the 2nd image that end up extending the prep time from first file accesss to core inference to 5-40 minutes on each generation.

Due to prewarming the data into OS cache, there's literally no storage hardware IO to the LTX model file (outside of the 512GB of RAM) during the circled portion below:

23:47:xx to 23:56:xx before it starts moving again, all RAM (yes, IO, but OS memory cache, not storage hardware) to GPU transfers:
image

Here are some example hiccups, using Process Monitor, it's reading all that data out of OS-controlled RAM cache, during the file read calls below:

Without patch:
image

With patch:
image

In the examples above, it's just that one model, I'm not using any LORAs.

Not sure what else I can do to test? Test results about the same with and without this PR (other than the actual type of low level call being different with and without patch, if you look closely above), assuming the following means I have tested the PR:

git switch - (or sometimes just git switch master, since - works and then doesn't?)
(tells me I am on master)
git fetch
git pull
git switch pr/14166
(tells me: Switched to branch 'pr/14116')
Restart SwarmUI/ComfyUI.

@brendanhoar
Copy link
Copy Markdown
Contributor

brendanhoar commented May 29, 2026

Setting back to the last commit into the 0.2.1 release (git checkout 26515ac), and setting --disable-smart-memory, the OS Cache reads from RAM into the GPU are much larger chunks (alternating 132MB/8KB reads), this means about 1 minute to start the main inference cycle, instead of 5-40 minutes.

So, I'm going to leave it there for now.

image

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented May 29, 2026

Pretty much for us it will never page out. My page file is 2gb, only there for an emergency. Since you are on v21 what speeds do you get with --disable-dynamic-vram? v21/v22 are the last versions it wasn't hobbled. I'm curious how it was on windows.

this patch seems to undo some of the undesirable behavior: #14162

rattus128 added 11 commits May 30, 2026 06:19
Make destination optional (or make it optionally GPU) and use aimdo
to file_read direct to GPU.
This consumed too much RAM and its better to just take the hit on
the CPU syncing back the stream on a short ring buffer. Aimdo
implements this so just rip the stream pin buffer from comfy.
Its better to just let the active model load past the pin limit as
pins and let the pins move around. The saves the HDD and SATA
people disk traffic while only costing a few GPU syncs.
This opens on windows with more favourable flags
Exclude live loras from the numbers to avoid the case where the reported
loaded memory exceeds the size of the model.

This causes me confusion in the Kijai visualizer when it looked fully
loaded but was hitting disk due to this accounding disrepency.
useful for max scattering something ordered.
Use a max scatter alogorithm to prioritize pins of the same size such
that when doing a little bit of offloading it gets scattered, allowing
the prefetcher to more evenly swollow the offload.
Aimdo 0.4.7 implement VRAM buffer exhaustion predection to avoid
early speculative load of weights that definately wont fix once the
inference gets further in.
This could happen mid prefetch block, cause a sync of the entire
block and lose overlap. Get ahead of the problem with a free down
at the natural compute stream sync point.
This is reasonably bad if it starts causing swap pressure, moreso than
during normal ram-cache proceedings. Clamp it.
@rattus128 rattus128 force-pushed the prs/aimdo-046-threaded-loader-2 branch from bf0ac49 to 4367270 Compare May 29, 2026 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants