Skip to content

Fix stable_video_diffusion#13684

Open
hlky wants to merge 1 commit intohuggingface:mainfrom
hlky:fix-13627
Open

Fix stable_video_diffusion#13684
hlky wants to merge 1 commit intohuggingface:mainfrom
hlky:fix-13627

Conversation

@hlky
Copy link
Copy Markdown
Contributor

@hlky hlky commented May 6, 2026

Fix stable_video_diffusion

Fixes #13627

Fixes

Issue 1

Return (frames,) when return_dict=False so Stable Video Diffusion follows the standard single-output tuple contract.

Issue 2

Use the maximum of min_guidance_scale and max_guidance_scale when preparing CFG state, so decreasing guidance schedules still duplicate conditioning batches correctly.

Issue 3

Preprocess PIL, NumPy, tensor, and list image inputs through shared processor utilities before CLIP resizing, so tensor inputs are resized consistently with PIL inputs.

Issue 4

Cast custom latents to the denoising dtype and device instead of only moving them to device.

Issue 5

Validate tuple/list config lengths in UNetSpatioTemporalConditionModel before indexed access.

Additional fixes

Batch and dtype consistency

Repeat image embeddings, VAE image latents, added time IDs, and guidance scale in effective batch order for num_videos_per_prompt.

Use the UNet dtype for denoising-path tensors while preserving VAE dtype/upcast handling.

Docs and typing

Update SVD docstrings and type hints for supported image inputs, output_type values, generator-list support, helper returns, and tuple/dataclass output behavior.

Meta issue patterns

Fixed: Pattern 1 batch/conditioning expansion, Pattern 5 dtype/device/config assumptions, Pattern 6 output contract, Pattern 7 validation/runtime alignment, Pattern 10 fast coverage.

Not applicable: Pattern 2 ignored public arguments, Pattern 3 mask handling, Pattern 4 optional dependency/default handling, Pattern 8 copied-code drift, Pattern 9 shared attention/offload infrastructure.

Unskipped tests

test_inference_batch_single_identical

Already passing after removing skip.

test_inference_batch_consistent

It was failing in part due to Issue 3:

src\diffusers\pipelines\stable_video_diffusion\pipeline_stable_video_diffusion.py:503: in __call__
    image_embeddings = self._encode_image(image, device, num_videos_per_prompt, self.do_classifier_free_guidance)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
src\diffusers\pipelines\stable_video_diffusion\pipeline_stable_video_diffusion.py:201: in _encode_image
    image = self.video_processor.pil_to_numpy(image)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
src\diffusers\image_processor.py:166: in pil_to_numpy
    images = [np.array(image).astype(np.float32) / 255.0 for image in images]

The test does this batched_input[name] = batch_size * [value], the old check if not isinstance(image, torch.Tensor) fails, and torch.Tensor ends up in pil_to_numpy. Solution: just use Processor classes.

test_float16_inference

Fixed by using prepare_latents in torch.float32 then casting to needed type. This is because randn_tensor with torch.float32 produces completely different Tensor than randn_tensor with torch.float16. Recommend producing random tensors in float32 then casting when reproducibility is a concern.

Notes

Slow test expected slice may have changed from prepare_latents change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

stable_video_diffusion model/pipeline review

1 participant