-
Notifications
You must be signed in to change notification settings - Fork 7k
feat: Add Motif-Video model and pipelines #13551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
waitingcheung
wants to merge
88
commits into
huggingface:main
Choose a base branch
from
waitingcheung:feat/motif-video
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
88 commits
Select commit
Hold shift + click to select a range
81cce23
feat: add Motif Video T2V and I2V pipelines with AdaptiveProjectedGui…
44045b2
Merge branch 'main' into feat/motif-video
waitingcheung 127810b
Merge branch 'main' into feat/motif-video
waitingcheung c2f1a14
Merge branch 'main' into feat/motif-video
waitingcheung e3230cc
Merge branch 'feat/motif-video' of github.com:waitingcheung/diffusers…
de3c7ff
Remove linear quadratic
9b0fe33
Remove musicldm
27b646a
Update docstring
fb74918
Address vision_encoder comment
f27c663
Add copy source in I2V pippeline
5f1c157
Refactor _get_prompt_embeds
80aa88a
Fix a typo
e47f893
Refactor MotifVideo transformer to use diffusers Attention conventions
waitingcheung ce30711
Use base classes for scheduler and guider
waitingcheung 005bec4
Implement MotifVideoAttention
waitingcheung 070ff88
Update style and quality
waitingcheung f118214
Fix a typo
waitingcheung cd8e91f
Fix a typo
waitingcheung 112761b
Fix a typo
waitingcheung c3d7ca1
Update year
waitingcheung 3bc4f31
Address rope dtype
waitingcheung 5a7bdff
Update docstring and remove frame_rate
waitingcheung 3ef2018
Address unused sigmas
waitingcheung 860634c
Add available processors
waitingcheung d3069c6
Address copy from comment
waitingcheung 35f26c8
Remove torch.no_grad()
waitingcheung 6a9ca5d
Remove use_attention_mask
waitingcheung ed4b717
Merge branch 'main' into feat/motif-video
waitingcheung 033b3bd
Merge branch 'feat/motif-video' of github.com:waitingcheung/diffusers…
waitingcheung 26ff14c
Address inline cross-attention
waitingcheung 553e033
Address compute dtype
waitingcheung b44d940
Remove unused variables
waitingcheung d5f2d70
Merge branch 'main' into feat/motif-video
waitingcheung 8ec5a81
Merge branch 'main' into feat/motif-video
waitingcheung a4a451e
Merge main APG into this branch and update documentation
waitingcheung b0dcc21
Refactor cross attention processor
waitingcheung c942e24
Merge branch 'main' into feat/motif-video
waitingcheung 7538471
Merge branch 'main' into feat/motif-video
waitingcheung 531c4b6
Merge branch 'main' into feat/motif-video
waitingcheung 9fc6616
Merge branch 'main' into feat/motif-video
waitingcheung 3b1276c
Remove unused timestep
waitingcheung 0666f97
Inline create_attention_mask
waitingcheung f39cdce
Make guider required
waitingcheung f647cb6
Address encode_prompt comment
waitingcheung 488aaf5
Address preprocess_video comment
waitingcheung 06c0604
Use T5Gemma2Encoder in test cases
waitingcheung 4a7e229
Address None feature_extractor
waitingcheung 841ae87
Address output type
waitingcheung ef1c21d
Renable skipped tests
waitingcheung e58e1b0
Update style and quality
waitingcheung d6344e7
Generate standard transformer test case
waitingcheung 4db06cc
Add model test case
waitingcheung a825a24
Remove guider in documentation
waitingcheung 9e1b353
Implement cross_attn layer
waitingcheung 684e9d4
Remove prepare_negative_prompt
waitingcheung a63f669
Address latent is None
waitingcheung 4945239
Clean up feature_extractor
waitingcheung 58c64d6
Fix prepare_latents
waitingcheung b832444
Remove transformers assertion
waitingcheung f02880a
Fix style and quality
waitingcheung b02ff93
Merge branch 'main' into feat/motif-video
waitingcheung bad1abb
Merge branch 'feat/motif-video' of github.com:waitingcheung/diffusers…
waitingcheung 2ecf212
Merge branch 'main' into feat/motif-video
waitingcheung 76f5c65
Merge branch 'main' into feat/motif-video
dg845 b324cee
Fix python utils/check_copies.py --fix_and_overwrite
waitingcheung 1931ab1
Add dropout rate to text config
waitingcheung e355768
Skip tests requiring guidance_scale
waitingcheung 37fdd69
Merge branch 'main' into feat/motif-video
waitingcheung 3907283
Fix encode_prompt in test cases
waitingcheung c5a3ffe
Merge branch 'main' into feat/motif-video
waitingcheung b0dbfad
Fix test_cpu_offload_forward_pass_twice
waitingcheung 576c22e
Merge branch 'feat/motif-video' of github.com:waitingcheung/diffusers…
waitingcheung 9de497f
Merge branch 'main' into feat/motif-video
waitingcheung 175a05d
Merge branch 'main' into feat/motif-video
waitingcheung ebed9ac
Merge branch 'main' into feat/motif-video
waitingcheung ab4273e
Merge branch 'main' into feat/motif-video
waitingcheung 3ee4218
Update tests/pipelines/motif_video/test_motif_video.py
waitingcheung 754a547
Update tests/pipelines/motif_video/test_motif_video.py
waitingcheung fe938cb
Update tests/pipelines/motif_video/test_motif_video.py
waitingcheung 0d11cc4
Update tests/pipelines/motif_video/test_motif_video_image2video.py
waitingcheung 0fcbd64
Address test_attention_slicing_forward_pass comment
waitingcheung f3325dd
Merge branch 'feat/motif-video' of github.com:waitingcheung/diffusers…
waitingcheung b33ec8e
Update tests/pipelines/motif_video/test_motif_video_image2video.py
waitingcheung 65c9dc6
Update tests/pipelines/motif_video/test_motif_video_image2video.py
waitingcheung b9de465
Update tests/pipelines/motif_video/test_motif_video_image2video.py
waitingcheung 9c38b3d
Skip I2V test cases
waitingcheung 0e89d56
Merge branch 'feat/motif-video' of github.com:waitingcheung/diffusers…
waitingcheung f2aab3c
Fix style and quality
waitingcheung File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,32 @@ | ||
| <!-- Copyright 2026 The HuggingFace Team. All rights reserved. | ||
|
|
||
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
| specific language governing permissions and limitations under the License. --> | ||
|
|
||
| # MotifVideoTransformer3DModel | ||
|
|
||
| A Diffusion Transformer model for 3D video-like data was introduced in Motif-Video by the Motif Technologies Team. | ||
|
|
||
| The model uses a three-stage architecture with 12 dual-stream + 16 single-stream + 8 DDT decoder layers and rotary positional embeddings (RoPE) for video generation. | ||
|
|
||
| The model can be loaded with the following code snippet. | ||
|
|
||
| ```python | ||
| from diffusers import MotifVideoTransformer3DModel | ||
|
|
||
| transformer = MotifVideoTransformer3DModel.from_pretrained("Motif-Technologies/Motif-Video-2B", subfolder="transformer", torch_dtype=torch.bfloat16) | ||
| ``` | ||
|
|
||
| ## MotifVideoTransformer3DModel | ||
|
|
||
| [[autodoc]] MotifVideoTransformer3DModel | ||
|
|
||
| ## Transformer2DModelOutput | ||
|
|
||
| [[autodoc]] models.modeling_outputs.Transformer2DModelOutput |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,123 @@ | ||
| <!-- Copyright 2026 The HuggingFace Team. All rights reserved. --> | ||
|
|
||
| # Motif-Video | ||
|
|
||
| [Technical Report](https://arxiv.org/abs/2604.16503) | ||
|
|
||
| Motif-Video is a 2B parameter diffusion transformer designed for text-to-video and image-to-video generation. It features a three-stage architecture with 12 dual-stream + 16 single-stream + 8 DDT decoder layers, Shared Cross-Attention for stable text-video alignment under long video sequences, T5Gemma2 text encoder, and rectified flow matching for velocity prediction. | ||
|
|
||
| <p align="center"> | ||
| <img src="https://huggingface.co/Motif-Technologies/Motif-Video-2B/resolve/main/assets/architecture.png" width="90%" alt="Motif-Video architecture"/> | ||
| </p> | ||
|
|
||
| ## Text-to-Video Generation | ||
|
|
||
| Use `MotifVideoPipeline` for text-to-video generation: | ||
|
|
||
| ```python | ||
| import torch | ||
| from diffusers import MotifVideoPipeline | ||
| from diffusers.utils import export_to_video | ||
|
|
||
|
|
||
| pipe = MotifVideoPipeline.from_pretrained( | ||
| "Motif-Technologies/Motif-Video-2B", | ||
| torch_dtype=torch.bfloat16, | ||
| ) | ||
| pipe.to("cuda") | ||
|
|
||
| prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair." | ||
| negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted" | ||
|
|
||
| video = pipe( | ||
| prompt=prompt, | ||
| negative_prompt=negative_prompt, | ||
| width=1280, | ||
| height=736, | ||
| num_frames=121, | ||
| num_inference_steps=50, | ||
| ).frames[0] | ||
| export_to_video(video, "output.mp4", fps=24) | ||
| ``` | ||
|
|
||
| ## Image-to-Video Generation | ||
|
|
||
| Use `MotifVideoImage2VideoPipeline` for image-to-video generation: | ||
|
|
||
| ```python | ||
| import torch | ||
| from diffusers import MotifVideoImage2VideoPipeline | ||
| from diffusers.utils import export_to_video, load_image | ||
|
|
||
|
|
||
| pipe = MotifVideoImage2VideoPipeline.from_pretrained( | ||
| "Motif-Technologies/Motif-Video-2B", | ||
| torch_dtype=torch.bfloat16, | ||
| ) | ||
| pipe.to("cuda") | ||
|
|
||
| image = load_image("input_image.png") | ||
| prompt = "A cinematic scene with vivid colors." | ||
| negative_prompt = "worst quality, blurry, jittery, distorted" | ||
|
|
||
| video = pipe( | ||
| image=image, | ||
| prompt=prompt, | ||
| negative_prompt=negative_prompt, | ||
| width=1280, | ||
| height=736, | ||
| num_frames=121, | ||
| num_inference_steps=50, | ||
| ).frames[0] | ||
| export_to_video(video, "i2v_output.mp4", fps=24) | ||
| ``` | ||
|
|
||
| ### Memory-efficient Inference | ||
|
|
||
| For GPUs with less than 30GB VRAM (e.g., RTX 4090), use model CPU offloading: | ||
|
|
||
| ```bash | ||
| export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True | ||
| ``` | ||
|
|
||
| ```python | ||
| import torch | ||
| from diffusers import MotifVideoPipeline | ||
| from diffusers.utils import export_to_video | ||
|
|
||
|
|
||
| pipe = MotifVideoPipeline.from_pretrained( | ||
| "Motif-Technologies/Motif-Video-2B", | ||
| torch_dtype=torch.bfloat16, | ||
| ) | ||
| pipe.enable_model_cpu_offload() | ||
|
|
||
| prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair." | ||
| negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted" | ||
|
|
||
| video = pipe( | ||
| prompt=prompt, | ||
| negative_prompt=negative_prompt, | ||
| width=1280, | ||
| height=736, | ||
| num_frames=121, | ||
| num_inference_steps=50, | ||
| ).frames[0] | ||
| export_to_video(video, "output.mp4", fps=24) | ||
| ``` | ||
|
|
||
| ## MotifVideoPipeline | ||
|
|
||
| [[autodoc]] MotifVideoPipeline | ||
| - all | ||
| - __call__ | ||
|
|
||
| ## MotifVideoImage2VideoPipeline | ||
|
|
||
| [[autodoc]] MotifVideoImage2VideoPipeline | ||
| - all | ||
| - __call__ | ||
|
|
||
| ## MotifVideoPipelineOutput | ||
|
|
||
| [[autodoc]] pipelines.motif_video.pipeline_output.MotifVideoPipelineOutput |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@waitingcheung
ohh, this change does not seems to be made by our formatter, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. This change is made by us to be compatible with our GGUF checkpoints.
https://huggingface.co/Motif-Technologies/Motif-Video-2B-GGUF