Reintroduce @pollockjj's tiled-VAE and UPSCALE_MODEL MultiGPU lanes#14067
Draft
Kosinkadink wants to merge 159 commits into
Draft
Reintroduce @pollockjj's tiled-VAE and UPSCALE_MODEL MultiGPU lanes#14067Kosinkadink wants to merge 159 commits into
Kosinkadink wants to merge 159 commits into
Conversation
…about the full scope of current sampling run, fix Hook Keyframes' guarantee_steps=1 inconsistent behavior with sampling split across different Sampling nodes/sampling runs by referencing 'sigmas'
…ches to use target_dict instead of target so that more information can be provided about the current execution environment if needed
… to separate out Wrappers/Callbacks/Patches into different hook types (all affect transformer_options)
… hook_type, modified necessary code to no longer need to manually separate out hooks by hook_type
…ptions to not conflict with the "sigmas" that will overwrite "sigmas" in _calc_cond_batch
…ade AddModelsHook operational and compliant with should_register result, moved TransformerOptionsHook handling out of ModelPatcher.register_all_hook_patches, support patches in TransformerOptionsHook properly by casting any patches/wrappers/hooks to proper device at sample time
…nsHook are not yet operational
…ops nodes by properly caching between positive and negative conds, make hook_patches_backup behave as intended (in the case that something pre-registers WeightHooks on the ModelPatcher instead of registering it at sample time)
…added some doc strings and removed a so-far unused variable
…ok to InjectionsHook (not yet implemented, but at least getting the naming figured out)
…t torch hardware device
load_checkpoint_guess_config_clip_only() calls load_checkpoint_guess_config() with output_model=False, leaving out[0] as None. The subsequent unconditional assignment of cached_patcher_init crashed with AttributeError, breaking CLIP-only checkpoint loading entirely. Guard the assignment with a None check. Amp-Thread-ID: https://ampcode.com/threads/T-019e43b8-8258-70fd-ab3a-53e4c97f85d5 Co-authored-by: Amp <amp@ampcode.com>
torch.device(i) defaults to CUDA, so XPU/NPU branches were producing 'cuda:N' devices that don't match get_torch_device() output ('xpu:N'/'npu:N'). This caused devices.remove(get_torch_device()) to raise ValueError when exclude_current=True on non-NVIDIA hardware. Use explicit device strings, and guard the remove() with a membership check for safety.
Amp-Thread-ID: https://ampcode.com/threads/T-019e43b8-8258-70fd-ab3a-53e4c97f85d5
Co-authored-by: Amp <amp@ampcode.com>
create_multigpu_deepclones cloned the existing 'multigpu' additional_models list verbatim and never pruned entries beyond limit_extra_devices. If a workflow was previously prepared for more GPUs, reducing max_gpus would leave stale clones attached and eligible for later scheduling. Replace the TODO block with a real prune that keeps only clones whose load_device is either the model's load_device or in limit_extra_devices, and re-match clones if anything was removed. Amp-Thread-ID: https://ampcode.com/threads/T-019e43b8-8258-70fd-ab3a-53e4c97f85d5 Co-authored-by: Amp <amp@ampcode.com>
…path) The multigpu cond-batching loop called model.memory_required(input_shape) without conditioning shapes, while the single-GPU path at line 279 passes cond_shapes. Large conditioning tensors (e.g. video prompts, control inputs) were therefore under-counted, risking OOM at runtime when the chosen batch size was too large. Match the single-GPU pattern by building cond_shapes from each batched cond's conditioning dict and passing it to memory_required. Amp-Thread-ID: https://ampcode.com/threads/T-019e43b8-8258-70fd-ab3a-53e4c97f85d5 Co-authored-by: Amp <amp@ampcode.com>
…ed gap Two doc-only changes addressing minor CodeRabbit findings on PR #7063: * cli_args.py: clarify --cuda-device help text to document the required comma-separated format ('0' or '0,1'), matching how the value is consumed by CUDA_VISIBLE_DEVICES in main.py. * nodes_multigpu.py: add a docstring NOTE on the (currently unregistered) MultiGPUOptionsNode explaining that its relative_speed input is plumbed through to model_options['multigpu_options'] but is not yet consulted by the cond scheduler, which still uses uniform round-robin via next_available_device(). Wire relative_speed into the scheduler before re-enabling the node. Amp-Thread-ID: https://ampcode.com/threads/T-019e43b8-8258-70fd-ab3a-53e4c97f85d5 Co-authored-by: Amp <amp@ampcode.com>
Drop the new ignore_multigpu positional argument from prepare_state and from the ON_PREPARE_STATE callbacks; pass the flag via model_options instead. This restores the original 3-arg callback signature so existing custom-node ON_PREPARE_STATE handlers keep working unchanged, while still letting prepare_state's recursive call into multigpu_clones short-circuit. Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082 Co-authored-by: Amp <amp@ampcode.com>
QwenFunControlNet.pre_run stashes the model's diffusion_model into self.extra_args['base_model'], but ControlBase.cleanup never clears extra_args. The diffusion_model reference therefore lingered between sampling runs, blocking ComfyUI's model offload/eviction logic from freeing the UNet and -- for multigpu -- holding one such reference per per-device control clone (defeating the max_gpus pruning added in this PR). Override cleanup to drop the entry; super().cleanup() already recurses into multigpu_clones so each per-device clone pops its own. Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082 Co-authored-by: Amp <amp@ampcode.com>
QwenFunControlNet.pre_run stashes model.diffusion_model into extra_args, which the control_model then uses for forward passes (img_in, txt_in, pe_embedder, time_text_embed). With multigpu, every per-device control clone was being pre_run with the base model on GPU0, so secondary devices would invoke those modules with parameters on GPU0 and inputs on their own device, raising 'Expected all tensors to be on the same device'. Build a device -> per-device BaseModel lookup from the patcher's additional multigpu models and pass each clone the model on its own device. Falls back to the base model when no per-device match is found (single-GPU path and the case where cnet.multigpu_clones lags the patcher's clone set). Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082 Co-authored-by: Amp <amp@ampcode.com>
Fix CodeRabbit findings in worksplit-multigpu
Per review feedback on #7063. The two functions share the conds-by-hooks accumulation, memory-fit batching, and per-chunk output aggregation; the multigpu variant adds per-device scheduling, .to(device) placement, per-device patcher/control lookup, and thread-pool dispatch around the inner loop. Documenting the relationship without extracting helpers -- extraction can land after the initial worksplit-multigpu release once both paths have settled. Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082 Co-authored-by: Amp <amp@ampcode.com>
The previous gate (len(cond_or_uncond) == 2 and set == {0, 1}) was
intended to skip the cond/uncond swap when only one half was present
under MultiGPU CFG Split, but it was too restrictive: it also skipped
batch_size > 1 + CFG (cond_or_uncond like [0, 0, 1, 1] or [0,0,0,0,
1,1,1,1]), where chunk(2) still splits the batch cleanly into a cond
half and an uncond half and the swap is still required.
Switch to context.shape[0] >= 2, matching the parallel fix landed on
master in #13699. The swap is a permutation-invariant no-op when the
two halves don't form a CFG pair (since the output swap_cfg_halves
block immediately undoes the permutation), so the only thing the gate
actually needs to do is guard against chunk(2) on a batch of one.
Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082
Co-authored-by: Amp <amp@ampcode.com>
CrossAttention.kv.view and Attention.qkv_combined.view both hardcoded batch=1 in the reshape, crashing or silently mis-shaping whenever the actual batch dimension was greater than 1. These were fixed on master in #13699 as part of the same patch that gated the chunk(2) swap, but worksplit-multigpu only picked up the chunk(2) gate. Bring the two view() fixes over so we have parity with master. Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082 Co-authored-by: Amp <amp@ampcode.com>
Brings in 18 commits from master so worksplit-multigpu does not regress fixes that landed on main since the last sync: - #13699 Hunyuan 3D 2.1 batch-size fixes (overlap with our own backport; conflict resolved in favor of the shape>=2 gate that binds swap_cfg_halves once and reuses it for the output swap-back) - #14031 ModelPatcherDynamic lora reshape / backup restore fix - #13802 Multi-threaded model load (memory_management / pinned_memory / model_management / aimdo plumbing) - #12679 lanczos single-channel tensor fix - #14010 Stable Audio 3 support - assorted partner-node, openapi, workflow-template, and tooling updates Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082 Co-authored-by: Amp <amp@ampcode.com>
Two CodeRabbit findings from #7063 (#13 and #14) are deferred because worksplit-multigpu's initial release scope is NVIDIA-only QA. Leave a TODO at the unconditional torch.cuda.set_device call and at the post-aggregation point so the required guards/synchronize are easy to find when multigpu support is extended to XPU/NPU/MPS/CPU/DirectML. Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082 Co-authored-by: Amp <amp@ampcode.com>
The /system_stats endpoint was returning a hardcoded single-element devices list built from get_torch_device(), which only reflects the primary CUDA device. On multi-GPU systems this hides the additional devices from frontends / tooling (the API surface that enables multigpu support discovery). Switch to iterating get_all_torch_devices(), with the primary device kept first so existing clients reading devices[0] keep working. (Worksplit-multigpu-only: get_all_torch_devices is the multigpu helper introduced on this branch; master's /system_stats remains unchanged.) Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082 Co-authored-by: Amp <amp@ampcode.com>
This was an attempt to be a fast path by ensuring the file slice was created by the owning thread and refusing without needing ot mutex but worksplit-multigpu doesnt work that way. Go mutex. Shoot me for overthinking next time.
Comfy-aimdo 0.4.4 contains a small bugfix to allow recovery of a hostbuf after full truncation. This pattern doesnt happen as a general rule, but does happen in the upcoming worksplit-multigpu branch.
Introduce tiled_scale_multidim_multigpu in comfy/utils.py: a tile scheduler that dispatches per-device tile functions through the existing MultiGPUThreadPool and merges per-device CPU output buffers in deterministic key order. The worker only catches BaseException at the thread boundary to funnel errors to the main thread; bare torch.cuda.set_device and torch.cuda.synchronize calls inside the worker fail loud if the device is not CUDA, which is part of the primitive's contract. Add UPSCALE_MODEL input on the MultiGPU CFG Split node and an upscale-model descriptor deepclone helper in comfy/multigpu.py. Clones stay CPU-resident until execute time and are returned to CPU afterward. ImageUpscaleWithModel dispatches through tiled_scale_multidim_multigpu when a multigpu descriptor is attached; the single-device path runs unchanged when no clones are present.
fixup threaded loader with worksplit multi-gpu
* Revert "Add tiled VAE lane to MultiGPU Work Units" This reverts commit 4d3d68e. The tiled VAE lane will land as part of a follow-up PR alongside the UPSCALE_MODEL lane, separated from the threaded-loader fix PR (#14052) to keep the upstream merge focused. * Revert "Add UPSCALE_MODEL lane to MultiGPU CFG Split" This reverts commit 74b0a82. The UPSCALE_MODEL lane will land as part of a follow-up PR alongside the tiled VAE lane, separated from the threaded-loader fix PR (#14052) to keep the upstream merge focused. --------- Co-authored-by: John Pollock <pollockjj@gmail.com>
Introduce tiled_scale_multidim_multigpu in comfy/utils.py: a tile scheduler that dispatches per-device tile functions through the existing MultiGPUThreadPool and merges per-device CPU output buffers in deterministic key order. The worker only catches BaseException at the thread boundary to funnel errors to the main thread; bare torch.cuda.set_device and torch.cuda.synchronize calls inside the worker fail loud if the device is not CUDA, which is part of the primitive's contract. Add UPSCALE_MODEL input on the MultiGPU CFG Split node and an upscale-model descriptor deepclone helper in comfy/multigpu.py. Clones stay CPU-resident until execute time and are returned to CPU afterward. ImageUpscaleWithModel dispatches through tiled_scale_multidim_multigpu when a multigpu descriptor is attached; the single-device path runs unchanged when no clones are present.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reapplies the two
@pollockjjcommits that were temporarily reverted by #14066 so they could be separated from #14052 (threaded-loader fix).Cherry-picked in original order:
74b0a826—Add UPSCALE_MODEL lane to MultiGPU CFG Split4d3d68e4—Add tiled VAE lane to MultiGPU Work UnitsAuthorship preserved (
John Pollock <pollockjj@gmail.com>).Companion to #14066.