Reintroduce @pollockjj's tiled-VAE and UPSCALE_MODEL MultiGPU lanes by Kosinkadink · Pull Request #14067 · Comfy-Org/ComfyUI

Kosinkadink · 2026-05-22T23:45:14Z

Reapplies the two @pollockjj commits that were temporarily reverted by #14066 so they could be separated from #14052 (threaded-loader fix).

Cherry-picked in original order:

74b0a826 — Add UPSCALE_MODEL lane to MultiGPU CFG Split
4d3d68e4 — Add tiled VAE lane to MultiGPU Work Units

Authorship preserved (John Pollock <pollockjj@gmail.com>).

Companion to #14066.

…about the full scope of current sampling run, fix Hook Keyframes' guarantee_steps=1 inconsistent behavior with sampling split across different Sampling nodes/sampling runs by referencing 'sigmas'

…ches to use target_dict instead of target so that more information can be provided about the current execution environment if needed

… to separate out Wrappers/Callbacks/Patches into different hook types (all affect transformer_options)

… hook_type, modified necessary code to no longer need to manually separate out hooks by hook_type

…ptions to not conflict with the "sigmas" that will overwrite "sigmas" in _calc_cond_batch

…ade AddModelsHook operational and compliant with should_register result, moved TransformerOptionsHook handling out of ModelPatcher.register_all_hook_patches, support patches in TransformerOptionsHook properly by casting any patches/wrappers/hooks to proper device at sample time

…nsHook are not yet operational

…ops nodes by properly caching between positive and negative conds, make hook_patches_backup behave as intended (in the case that something pre-registers WeightHooks on the ModelPatcher instead of registering it at sample time)

…added some doc strings and removed a so-far unused variable

…ok to InjectionsHook (not yet implemented, but at least getting the naming figured out)

…t torch hardware device

…ltiple GPUs

…ction

…nit__.py

load_checkpoint_guess_config_clip_only() calls load_checkpoint_guess_config() with output_model=False, leaving out[0] as None. The subsequent unconditional assignment of cached_patcher_init crashed with AttributeError, breaking CLIP-only checkpoint loading entirely. Guard the assignment with a None check. Amp-Thread-ID: https://ampcode.com/threads/T-019e43b8-8258-70fd-ab3a-53e4c97f85d5 Co-authored-by: Amp <amp@ampcode.com>

torch.device(i) defaults to CUDA, so XPU/NPU branches were producing 'cuda:N' devices that don't match get_torch_device() output ('xpu:N'/'npu:N'). This caused devices.remove(get_torch_device()) to raise ValueError when exclude_current=True on non-NVIDIA hardware. Use explicit device strings, and guard the remove() with a membership check for safety. Amp-Thread-ID: https://ampcode.com/threads/T-019e43b8-8258-70fd-ab3a-53e4c97f85d5 Co-authored-by: Amp <amp@ampcode.com>

create_multigpu_deepclones cloned the existing 'multigpu' additional_models list verbatim and never pruned entries beyond limit_extra_devices. If a workflow was previously prepared for more GPUs, reducing max_gpus would leave stale clones attached and eligible for later scheduling. Replace the TODO block with a real prune that keeps only clones whose load_device is either the model's load_device or in limit_extra_devices, and re-match clones if anything was removed. Amp-Thread-ID: https://ampcode.com/threads/T-019e43b8-8258-70fd-ab3a-53e4c97f85d5 Co-authored-by: Amp <amp@ampcode.com>

…path) The multigpu cond-batching loop called model.memory_required(input_shape) without conditioning shapes, while the single-GPU path at line 279 passes cond_shapes. Large conditioning tensors (e.g. video prompts, control inputs) were therefore under-counted, risking OOM at runtime when the chosen batch size was too large. Match the single-GPU pattern by building cond_shapes from each batched cond's conditioning dict and passing it to memory_required. Amp-Thread-ID: https://ampcode.com/threads/T-019e43b8-8258-70fd-ab3a-53e4c97f85d5 Co-authored-by: Amp <amp@ampcode.com>

…ed gap Two doc-only changes addressing minor CodeRabbit findings on PR #7063: * cli_args.py: clarify --cuda-device help text to document the required comma-separated format ('0' or '0,1'), matching how the value is consumed by CUDA_VISIBLE_DEVICES in main.py. * nodes_multigpu.py: add a docstring NOTE on the (currently unregistered) MultiGPUOptionsNode explaining that its relative_speed input is plumbed through to model_options['multigpu_options'] but is not yet consulted by the cond scheduler, which still uses uniform round-robin via next_available_device(). Wire relative_speed into the scheduler before re-enabling the node. Amp-Thread-ID: https://ampcode.com/threads/T-019e43b8-8258-70fd-ab3a-53e4c97f85d5 Co-authored-by: Amp <amp@ampcode.com>

Drop the new ignore_multigpu positional argument from prepare_state and from the ON_PREPARE_STATE callbacks; pass the flag via model_options instead. This restores the original 3-arg callback signature so existing custom-node ON_PREPARE_STATE handlers keep working unchanged, while still letting prepare_state's recursive call into multigpu_clones short-circuit. Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082 Co-authored-by: Amp <amp@ampcode.com>

QwenFunControlNet.pre_run stashes the model's diffusion_model into self.extra_args['base_model'], but ControlBase.cleanup never clears extra_args. The diffusion_model reference therefore lingered between sampling runs, blocking ComfyUI's model offload/eviction logic from freeing the UNet and -- for multigpu -- holding one such reference per per-device control clone (defeating the max_gpus pruning added in this PR). Override cleanup to drop the entry; super().cleanup() already recurses into multigpu_clones so each per-device clone pops its own. Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082 Co-authored-by: Amp <amp@ampcode.com>

QwenFunControlNet.pre_run stashes model.diffusion_model into extra_args, which the control_model then uses for forward passes (img_in, txt_in, pe_embedder, time_text_embed). With multigpu, every per-device control clone was being pre_run with the base model on GPU0, so secondary devices would invoke those modules with parameters on GPU0 and inputs on their own device, raising 'Expected all tensors to be on the same device'. Build a device -> per-device BaseModel lookup from the patcher's additional multigpu models and pass each clone the model on its own device. Falls back to the base model when no per-device match is found (single-GPU path and the case where cnet.multigpu_clones lags the patcher's clone set). Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082 Co-authored-by: Amp <amp@ampcode.com>

Fix CodeRabbit findings in worksplit-multigpu

Per review feedback on #7063. The two functions share the conds-by-hooks accumulation, memory-fit batching, and per-chunk output aggregation; the multigpu variant adds per-device scheduling, .to(device) placement, per-device patcher/control lookup, and thread-pool dispatch around the inner loop. Documenting the relationship without extracting helpers -- extraction can land after the initial worksplit-multigpu release once both paths have settled. Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082 Co-authored-by: Amp <amp@ampcode.com>

The previous gate (len(cond_or_uncond) == 2 and set == {0, 1}) was intended to skip the cond/uncond swap when only one half was present under MultiGPU CFG Split, but it was too restrictive: it also skipped batch_size > 1 + CFG (cond_or_uncond like [0, 0, 1, 1] or [0,0,0,0, 1,1,1,1]), where chunk(2) still splits the batch cleanly into a cond half and an uncond half and the swap is still required. Switch to context.shape[0] >= 2, matching the parallel fix landed on master in #13699. The swap is a permutation-invariant no-op when the two halves don't form a CFG pair (since the output swap_cfg_halves block immediately undoes the permutation), so the only thing the gate actually needs to do is guard against chunk(2) on a batch of one. Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082 Co-authored-by: Amp <amp@ampcode.com>

CrossAttention.kv.view and Attention.qkv_combined.view both hardcoded batch=1 in the reshape, crashing or silently mis-shaping whenever the actual batch dimension was greater than 1. These were fixed on master in #13699 as part of the same patch that gated the chunk(2) swap, but worksplit-multigpu only picked up the chunk(2) gate. Bring the two view() fixes over so we have parity with master. Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082 Co-authored-by: Amp <amp@ampcode.com>

Brings in 18 commits from master so worksplit-multigpu does not regress fixes that landed on main since the last sync: - #13699 Hunyuan 3D 2.1 batch-size fixes (overlap with our own backport; conflict resolved in favor of the shape>=2 gate that binds swap_cfg_halves once and reuses it for the output swap-back) - #14031 ModelPatcherDynamic lora reshape / backup restore fix - #13802 Multi-threaded model load (memory_management / pinned_memory / model_management / aimdo plumbing) - #12679 lanczos single-channel tensor fix - #14010 Stable Audio 3 support - assorted partner-node, openapi, workflow-template, and tooling updates Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082 Co-authored-by: Amp <amp@ampcode.com>

Two CodeRabbit findings from #7063 (#13 and #14) are deferred because worksplit-multigpu's initial release scope is NVIDIA-only QA. Leave a TODO at the unconditional torch.cuda.set_device call and at the post-aggregation point so the required guards/synchronize are easy to find when multigpu support is extended to XPU/NPU/MPS/CPU/DirectML. Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082 Co-authored-by: Amp <amp@ampcode.com>

The /system_stats endpoint was returning a hardcoded single-element devices list built from get_torch_device(), which only reflects the primary CUDA device. On multi-GPU systems this hides the additional devices from frontends / tooling (the API surface that enables multigpu support discovery). Switch to iterating get_all_torch_devices(), with the primary device kept first so existing clients reading devices[0] keep working. (Worksplit-multigpu-only: get_all_torch_devices is the multigpu helper introduced on this branch; master's /system_stats remains unchanged.) Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082 Co-authored-by: Amp <amp@ampcode.com>

This was an attempt to be a fast path by ensuring the file slice was created by the owning thread and refusing without needing ot mutex but worksplit-multigpu doesnt work that way. Go mutex. Shoot me for overthinking next time.

Comfy-aimdo 0.4.4 contains a small bugfix to allow recovery of a hostbuf after full truncation. This pattern doesnt happen as a general rule, but does happen in the upcoming worksplit-multigpu branch.

Introduce tiled_scale_multidim_multigpu in comfy/utils.py: a tile scheduler that dispatches per-device tile functions through the existing MultiGPUThreadPool and merges per-device CPU output buffers in deterministic key order. The worker only catches BaseException at the thread boundary to funnel errors to the main thread; bare torch.cuda.set_device and torch.cuda.synchronize calls inside the worker fail loud if the device is not CUDA, which is part of the primitive's contract. Add UPSCALE_MODEL input on the MultiGPU CFG Split node and an upscale-model descriptor deepclone helper in comfy/multigpu.py. Clones stay CPU-resident until execute time and are returned to CPU afterward. ImageUpscaleWithModel dispatches through tiled_scale_multidim_multigpu when a multigpu descriptor is attached; the single-device path runs unchanged when no clones are present.

fixup threaded loader with worksplit multi-gpu

* Revert "Add tiled VAE lane to MultiGPU Work Units" This reverts commit 4d3d68e. The tiled VAE lane will land as part of a follow-up PR alongside the UPSCALE_MODEL lane, separated from the threaded-loader fix PR (#14052) to keep the upstream merge focused. * Revert "Add UPSCALE_MODEL lane to MultiGPU CFG Split" This reverts commit 74b0a82. The UPSCALE_MODEL lane will land as part of a follow-up PR alongside the tiled VAE lane, separated from the threaded-loader fix PR (#14052) to keep the upstream merge focused. --------- Co-authored-by: John Pollock <pollockjj@gmail.com>

Introduce tiled_scale_multidim_multigpu in comfy/utils.py: a tile scheduler that dispatches per-device tile functions through the existing MultiGPUThreadPool and merges per-device CPU output buffers in deterministic key order. The worker only catches BaseException at the thread boundary to funnel errors to the main thread; bare torch.cuda.set_device and torch.cuda.synchronize calls inside the worker fail loud if the device is not CUDA, which is part of the primitive's contract. Add UPSCALE_MODEL input on the MultiGPU CFG Split node and an upscale-model descriptor deepclone helper in comfy/multigpu.py. Clones stay CPU-resident until execute time and are returned to CPU afterward. ImageUpscaleWithModel dispatches through tiled_scale_multidim_multigpu when a multigpu descriptor is attached; the single-device path runs unchanged when no clones are present.

Kosinkadink added 30 commits December 29, 2024 15:49

Add 'sigmas' to transformer_options so that downstream code can know …

72bbf49

…about the full scope of current sampling run, fix Hook Keyframes' guarantee_steps=1 inconsistent behavior with sampling split across different Sampling nodes/sampling runs by referencing 'sigmas'

Merge branch 'master' into hooks_part2

bf21be0

Merge branch 'master' into hooks_part2

d44295e

Cleaned up hooks.py, refactored Hook.should_register and add_hook_pat…

5a2ad03

…ches to use target_dict instead of target so that more information can be provided about the current execution environment if needed

Refactor WrapperHook into TransformerOptionsHook, as there is no need…

776aa73

… to separate out Wrappers/Callbacks/Patches into different hook types (all affect transformer_options)

Refactored HookGroup to also store a dictionary of hooks separated by…

111fd0c

… hook_type, modified necessary code to no longer need to manually separate out hooks by hook_type

In inner_sample, change "sigmas" to "sampler_sigmas" in transformer_o…

6620d86

…ptions to not conflict with the "sigmas" that will overwrite "sigmas" in _calc_cond_batch

Merge branch 'add_sample_sigmas' into hooks_part2

db2d7ad

Made hook clone code sane, made clear ObjectPatchHook and SetInjectio…

4446c86

…nsHook are not yet operational

Filter only registered hooks on self.conds in CFGGuider.sample

0a7e2ae

Merge branch 'master' into hooks_part2

6463c39

Make hook_scope functional for TransformerOptionsHook

f48f90e

Merge branch 'master' into hooks_part2

2724ac4

removed 4 whitespace lines to satisfy Ruff,

1b38f5b

Add a get_injections function to ModelPatcher

58bf881

Made TransformerOptionsHook contribute to registered hooks properly, …

216fea1

…added some doc strings and removed a so-far unused variable

Merge branch 'master' into hooks_part2

11c6d56

Rename AddModelsHooks to AdditionalModelsHook, rename SetInjectionsHo…

3cd4c5c

…ok to InjectionsHook (not yet implemented, but at least getting the naming figured out)

Clean up a typehint

7333281

Merge branch 'comfyanonymous:master' into multigpu_support

66838eb

Add get_all_torch_devices to get detected devices intended for curren…

871258a

…t torch hardware device

Initial proof of concept of giving splitting cond sampling between mu…

7448f02

…ltiple GPUs

Merge branch 'comfyanonymous:master' into multigpu_support

d3cf2b7

Fix cond_cat to not try to cast anything that doesn't have a 'to' fun…

e88c6c0

…ction

Merge branch 'master' into multigpu_support

8d4b501

Make test node for multigpu instead of storing it in just a local __i…

d508807

…nit__.py

Merge branch 'master' into multigpu_support

ec16ee2

Add nodes_multigpu.py to loaded nodes

198953c

Kosinkadink and others added 23 commits May 20, 2026 16:46

Fix CodeRabbit findings in worksplit-multigpu (#14017)

1417b71

Fix CodeRabbit findings in worksplit-multigpu

memory_management: replace thread refusal with mutex

df17b56

This was an attempt to be a fast path by ensuring the file slice was created by the owning thread and refusing without needing ot mutex but worksplit-multigpu doesnt work that way. Go mutex. Shoot me for overthinking next time.

comfy-aimdo 0.4.4

7a18f9a

Comfy-aimdo 0.4.4 contains a small bugfix to allow recovery of a hostbuf after full truncation. This pattern doesnt happen as a general rule, but does happen in the upcoming worksplit-multigpu branch.

Add tiled VAE lane to MultiGPU Work Units

4d3d68e

Merge pull request #14052 from rattus128/prs/worksplit-t-load-fix

cb83c41

fixup threaded loader with worksplit multi-gpu

Add tiled VAE lane to MultiGPU Work Units

7e2bcb4

Kosinkadink requested review from alexisrolland, comfyanonymous, guill, kijai and rattus128 as code owners May 22, 2026 23:45

Kosinkadink marked this pull request as draft May 22, 2026 23:46

Base automatically changed from worksplit-multigpu to master May 26, 2026 01:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reintroduce @pollockjj's tiled-VAE and UPSCALE_MODEL MultiGPU lanes#14067

Reintroduce @pollockjj's tiled-VAE and UPSCALE_MODEL MultiGPU lanes#14067
Kosinkadink wants to merge 159 commits into
masterfrom
reintroduce/pollockjj-vae-upscale-lanes

Kosinkadink commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Kosinkadink commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants