Fix stable_video_diffusion#13684
Open
hlky wants to merge 1 commit intohuggingface:mainfrom
Open
Conversation
76 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix
stable_video_diffusionFixes #13627
Fixes
Issue 1
Return
(frames,)whenreturn_dict=Falseso Stable Video Diffusion follows the standard single-output tuple contract.Issue 2
Use the maximum of
min_guidance_scaleandmax_guidance_scalewhen preparing CFG state, so decreasing guidance schedules still duplicate conditioning batches correctly.Issue 3
Preprocess PIL, NumPy, tensor, and list image inputs through shared processor utilities before CLIP resizing, so tensor inputs are resized consistently with PIL inputs.
Issue 4
Cast custom
latentsto the denoising dtype and device instead of only moving them to device.Issue 5
Validate tuple/list config lengths in
UNetSpatioTemporalConditionModelbefore indexed access.Additional fixes
Batch and dtype consistency
Repeat image embeddings, VAE image latents, added time IDs, and guidance scale in effective batch order for
num_videos_per_prompt.Use the UNet dtype for denoising-path tensors while preserving VAE dtype/upcast handling.
Docs and typing
Update SVD docstrings and type hints for supported image inputs,
output_typevalues, generator-list support, helper returns, and tuple/dataclass output behavior.Meta issue patterns
Fixed: Pattern 1 batch/conditioning expansion, Pattern 5 dtype/device/config assumptions, Pattern 6 output contract, Pattern 7 validation/runtime alignment, Pattern 10 fast coverage.
Not applicable: Pattern 2 ignored public arguments, Pattern 3 mask handling, Pattern 4 optional dependency/default handling, Pattern 8 copied-code drift, Pattern 9 shared attention/offload infrastructure.
Unskipped tests
test_inference_batch_single_identical
Already passing after removing skip.
test_inference_batch_consistent
It was failing in part due to Issue 3:
The test does this
batched_input[name] = batch_size * [value], the old checkif not isinstance(image, torch.Tensor)fails, andtorch.Tensorends up inpil_to_numpy. Solution: just use Processor classes.test_float16_inference
Fixed by using
prepare_latentsintorch.float32then casting to needed type. This is becauserandn_tensorwithtorch.float32produces completely different Tensor thanrandn_tensorwithtorch.float16. Recommend producing random tensors infloat32then casting when reproducibility is a concern.Notes
Slow test expected slice may have changed from
prepare_latentschange.