Fix nested tensor noise mismatch in CFGGuider.sample#13318
Fix nested tensor noise mismatch in CFGGuider.sample#13318djdarcy wants to merge 1 commit intoComfy-Org:masterfrom
Conversation
When using LTXAV (audio+video) workflows, latent_image is a NestedTensor but noise may be a regular tensor. Calling unbind() on non-nested noise splits along dim=0 (channels), producing a shape mismatch at noise_scaling. Check whether noise is nested before unbinding. If not, pad with zero-noise for additional components (e.g. audio), which is semantically correct since those components don't need denoising in the video sampler.
📝 WalkthroughWalkthroughThe 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
comfy/samplers.py (1)
1008-1018: Fix looks correct for the non-nested noise case.The conditional check for
noise.is_nestedproperly handles the mismatch scenario described in the PR. Usingtorch.zeros_like()for padding is appropriate since no denoising is applied to the padded audio components, and it mirrors the defensive pattern used fordenoise_maskhandling below.One minor observation: when
noise.is_nestedis True, the code unbinds without checking ifn_tensorshas the same number of components asli_tensors. Thedenoise_maskhandling (lines 1024-1030) defensively truncates and pads to matchlatent_shapes. If nested noise with mismatched components is a possible scenario, similar handling could be added here.💡 Optional: Add defensive handling for nested noise component mismatch
if latent_image.is_nested: li_tensors = latent_image.unbind() if noise.is_nested: n_tensors = noise.unbind() + n_tensors = list(n_tensors[:len(li_tensors)]) # Truncate if more + for i in range(len(n_tensors), len(li_tensors)): + n_tensors.append(torch.zeros_like(li_tensors[i])) # Pad if fewer else: # Noise only covers video -- pad remaining components (audio) with zeros n_tensors = [noise] for i in range(1, len(li_tensors)): n_tensors.append(torch.zeros_like(li_tensors[i])) latent_image, latent_shapes = comfy.utils.pack_latents(li_tensors) noise, _ = comfy.utils.pack_latents(n_tensors)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@comfy/samplers.py` around lines 1008 - 1018, When noise.is_nested is True, add defensive handling to ensure n_tensors has the same number of components as li_tensors before packing: after n_tensors = noise.unbind(), compare len(n_tensors) to len(li_tensors) (or latent_shapes) and if they differ truncate extra components or append torch.zeros_like(li_tensors[i]) for missing components (mirroring the denoise_mask truncation/padding behavior around denoise_mask handling). Then call comfy.utils.pack_latents(n_tensors) as before so latent_image/latent_shapes and noise align.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@comfy/samplers.py`:
- Around line 1008-1018: When noise.is_nested is True, add defensive handling to
ensure n_tensors has the same number of components as li_tensors before packing:
after n_tensors = noise.unbind(), compare len(n_tensors) to len(li_tensors) (or
latent_shapes) and if they differ truncate extra components or append
torch.zeros_like(li_tensors[i]) for missing components (mirroring the
denoise_mask truncation/padding behavior around denoise_mask handling). Then
call comfy.utils.pack_latents(n_tensors) as before so latent_image/latent_shapes
and noise align.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0bd2b158-65d9-4063-9ee1-223aafb1f8bb
📒 Files selected for processing (1)
comfy/samplers.py
Summary
Fixes
RuntimeError: The size of tensor a (N) must match the size of tensor b (M) at non-singleton dimension 2when using LTXAV audio+video workflows withSamplerCustomAdvanced.Problem
In
CFGGuider.sample()(comfy/samplers.py:1008-1010), whenlatent_imageis aNestedTensor(e.g. fromLTXVConcatAVLatentcombining video + audio latents), the code unconditionally callsnoise.unbind():When noise is a regular (non-nested) tensor,
unbind(dim=0)splits along the channel dimension, producing 128 small tensors instead of the expected 2 nested components (video + audio). Afterpack_latentsflattens these, the shapes are completely mismatched (e.g.[128, 1, 3751]vs[1, 1, 512384]), causing the RuntimeError atmodel_sampling.py:72innoise_scaling().How to reproduce
LTXVConcatAVLatentto combine video and audio latentsSamplerCustomAdvancedThis affects any LTXAV workflow where the noise generator produces non-nested noise for a nested latent. No custom nodes required to trigger this.
Fix
Check
noise.is_nestedbefore unbinding. If noise is not nested, treat it as the first component (video) and pad remaining components (audio) withtorch.zeros_like(). Zero noise for the audio component I believe is semantically correct. No denoising is applied to the audio padding in the video sampler.Note: the
denoise_maskhandling a few lines below (line 1014-1018) already does this same pattern correctly. It checksdenoise_mask.is_nestedand pads withtorch.ones()for missing components:This PR applies the same defensive pattern to noise handling.