Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
81cce23
feat: add Motif Video T2V and I2V pipelines with AdaptiveProjectedGui…
Apr 23, 2026
44045b2
Merge branch 'main' into feat/motif-video
waitingcheung Apr 27, 2026
127810b
Merge branch 'main' into feat/motif-video
waitingcheung Apr 28, 2026
c2f1a14
Merge branch 'main' into feat/motif-video
waitingcheung Apr 28, 2026
e3230cc
Merge branch 'feat/motif-video' of github.com:waitingcheung/diffusers…
Apr 29, 2026
de3c7ff
Remove linear quadratic
Apr 29, 2026
9b0fe33
Remove musicldm
Apr 29, 2026
27b646a
Update docstring
Apr 29, 2026
fb74918
Address vision_encoder comment
Apr 29, 2026
f27c663
Add copy source in I2V pippeline
Apr 29, 2026
5f1c157
Refactor _get_prompt_embeds
Apr 29, 2026
80aa88a
Fix a typo
Apr 29, 2026
e47f893
Refactor MotifVideo transformer to use diffusers Attention conventions
waitingcheung Apr 29, 2026
ce30711
Use base classes for scheduler and guider
waitingcheung Apr 29, 2026
005bec4
Implement MotifVideoAttention
waitingcheung Apr 29, 2026
070ff88
Update style and quality
waitingcheung Apr 29, 2026
f118214
Fix a typo
waitingcheung Apr 29, 2026
cd8e91f
Fix a typo
waitingcheung Apr 29, 2026
112761b
Fix a typo
waitingcheung Apr 29, 2026
c3d7ca1
Update year
waitingcheung Apr 29, 2026
3bc4f31
Address rope dtype
waitingcheung Apr 30, 2026
5a7bdff
Update docstring and remove frame_rate
waitingcheung Apr 30, 2026
3ef2018
Address unused sigmas
waitingcheung Apr 30, 2026
860634c
Add available processors
waitingcheung Apr 30, 2026
d3069c6
Address copy from comment
waitingcheung Apr 30, 2026
35f26c8
Remove torch.no_grad()
waitingcheung Apr 30, 2026
6a9ca5d
Remove use_attention_mask
waitingcheung Apr 30, 2026
ed4b717
Merge branch 'main' into feat/motif-video
waitingcheung Apr 30, 2026
033b3bd
Merge branch 'feat/motif-video' of github.com:waitingcheung/diffusers…
waitingcheung Apr 30, 2026
26ff14c
Address inline cross-attention
waitingcheung Apr 30, 2026
553e033
Address compute dtype
waitingcheung Apr 30, 2026
b44d940
Remove unused variables
waitingcheung Apr 30, 2026
d5f2d70
Merge branch 'main' into feat/motif-video
waitingcheung Apr 30, 2026
8ec5a81
Merge branch 'main' into feat/motif-video
waitingcheung Apr 30, 2026
a4a451e
Merge main APG into this branch and update documentation
waitingcheung May 1, 2026
b0dcc21
Refactor cross attention processor
waitingcheung May 1, 2026
c942e24
Merge branch 'main' into feat/motif-video
waitingcheung May 1, 2026
7538471
Merge branch 'main' into feat/motif-video
waitingcheung May 4, 2026
531c4b6
Merge branch 'main' into feat/motif-video
waitingcheung May 6, 2026
9fc6616
Merge branch 'main' into feat/motif-video
waitingcheung May 6, 2026
3b1276c
Remove unused timestep
waitingcheung May 6, 2026
0666f97
Inline create_attention_mask
waitingcheung May 6, 2026
f39cdce
Make guider required
waitingcheung May 6, 2026
f647cb6
Address encode_prompt comment
waitingcheung May 6, 2026
488aaf5
Address preprocess_video comment
waitingcheung May 6, 2026
06c0604
Use T5Gemma2Encoder in test cases
waitingcheung May 6, 2026
4a7e229
Address None feature_extractor
waitingcheung May 6, 2026
841ae87
Address output type
waitingcheung May 6, 2026
ef1c21d
Renable skipped tests
waitingcheung May 6, 2026
e58e1b0
Update style and quality
waitingcheung May 6, 2026
d6344e7
Generate standard transformer test case
waitingcheung May 6, 2026
4db06cc
Add model test case
waitingcheung May 6, 2026
a825a24
Remove guider in documentation
waitingcheung May 6, 2026
9e1b353
Implement cross_attn layer
waitingcheung May 6, 2026
684e9d4
Remove prepare_negative_prompt
waitingcheung May 6, 2026
a63f669
Address latent is None
waitingcheung May 6, 2026
4945239
Clean up feature_extractor
waitingcheung May 6, 2026
58c64d6
Fix prepare_latents
waitingcheung May 6, 2026
b832444
Remove transformers assertion
waitingcheung May 6, 2026
f02880a
Fix style and quality
waitingcheung May 6, 2026
b02ff93
Merge branch 'main' into feat/motif-video
waitingcheung May 6, 2026
bad1abb
Merge branch 'feat/motif-video' of github.com:waitingcheung/diffusers…
waitingcheung May 6, 2026
2ecf212
Merge branch 'main' into feat/motif-video
waitingcheung May 7, 2026
76f5c65
Merge branch 'main' into feat/motif-video
dg845 May 7, 2026
b324cee
Fix python utils/check_copies.py --fix_and_overwrite
waitingcheung May 7, 2026
1931ab1
Add dropout rate to text config
waitingcheung May 7, 2026
e355768
Skip tests requiring guidance_scale
waitingcheung May 7, 2026
37fdd69
Merge branch 'main' into feat/motif-video
waitingcheung May 7, 2026
3907283
Fix encode_prompt in test cases
waitingcheung May 7, 2026
c5a3ffe
Merge branch 'main' into feat/motif-video
waitingcheung May 7, 2026
b0dbfad
Fix test_cpu_offload_forward_pass_twice
waitingcheung May 7, 2026
576c22e
Merge branch 'feat/motif-video' of github.com:waitingcheung/diffusers…
waitingcheung May 7, 2026
9de497f
Merge branch 'main' into feat/motif-video
waitingcheung May 8, 2026
175a05d
Merge branch 'main' into feat/motif-video
waitingcheung May 8, 2026
ebed9ac
Merge branch 'main' into feat/motif-video
waitingcheung May 8, 2026
ab4273e
Merge branch 'main' into feat/motif-video
waitingcheung May 8, 2026
3ee4218
Update tests/pipelines/motif_video/test_motif_video.py
waitingcheung May 8, 2026
754a547
Update tests/pipelines/motif_video/test_motif_video.py
waitingcheung May 8, 2026
fe938cb
Update tests/pipelines/motif_video/test_motif_video.py
waitingcheung May 8, 2026
0d11cc4
Update tests/pipelines/motif_video/test_motif_video_image2video.py
waitingcheung May 8, 2026
0fcbd64
Address test_attention_slicing_forward_pass comment
waitingcheung May 8, 2026
f3325dd
Merge branch 'feat/motif-video' of github.com:waitingcheung/diffusers…
waitingcheung May 8, 2026
b33ec8e
Update tests/pipelines/motif_video/test_motif_video_image2video.py
waitingcheung May 8, 2026
65c9dc6
Update tests/pipelines/motif_video/test_motif_video_image2video.py
waitingcheung May 8, 2026
b9de465
Update tests/pipelines/motif_video/test_motif_video_image2video.py
waitingcheung May 8, 2026
9c38b3d
Skip I2V test cases
waitingcheung May 8, 2026
0e89d56
Merge branch 'feat/motif-video' of github.com:waitingcheung/diffusers…
waitingcheung May 8, 2026
f2aab3c
Fix style and quality
waitingcheung May 8, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions docs/source/en/api/models/motif_video_transformer_3d.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
<!-- Copyright 2026 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->

# MotifVideoTransformer3DModel

A Diffusion Transformer model for 3D video-like data was introduced in Motif-Video by the Motif Technologies Team.

The model uses a three-stage architecture with 12 dual-stream + 16 single-stream + 8 DDT decoder layers and rotary positional embeddings (RoPE) for video generation.

The model can be loaded with the following code snippet.

```python
from diffusers import MotifVideoTransformer3DModel

transformer = MotifVideoTransformer3DModel.from_pretrained("Motif-Technologies/Motif-Video-2B", subfolder="transformer", torch_dtype=torch.bfloat16)
```

## MotifVideoTransformer3DModel

[[autodoc]] MotifVideoTransformer3DModel

## Transformer2DModelOutput

[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
123 changes: 123 additions & 0 deletions docs/source/en/api/pipelines/motif_video.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
<!-- Copyright 2026 The HuggingFace Team. All rights reserved. -->

# Motif-Video

[Technical Report](https://arxiv.org/abs/2604.16503)

Motif-Video is a 2B parameter diffusion transformer designed for text-to-video and image-to-video generation. It features a three-stage architecture with 12 dual-stream + 16 single-stream + 8 DDT decoder layers, Shared Cross-Attention for stable text-video alignment under long video sequences, T5Gemma2 text encoder, and rectified flow matching for velocity prediction.

<p align="center">
<img src="https://huggingface.co/Motif-Technologies/Motif-Video-2B/resolve/main/assets/architecture.png" width="90%" alt="Motif-Video architecture"/>
</p>

## Text-to-Video Generation

Use `MotifVideoPipeline` for text-to-video generation:

```python
import torch
from diffusers import MotifVideoPipeline
from diffusers.utils import export_to_video


pipe = MotifVideoPipeline.from_pretrained(
"Motif-Technologies/Motif-Video-2B",
torch_dtype=torch.bfloat16,
)
pipe.to("cuda")

prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair."
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"

video = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
width=1280,
height=736,
num_frames=121,
num_inference_steps=50,
).frames[0]
export_to_video(video, "output.mp4", fps=24)
```

## Image-to-Video Generation

Use `MotifVideoImage2VideoPipeline` for image-to-video generation:

```python
import torch
from diffusers import MotifVideoImage2VideoPipeline
from diffusers.utils import export_to_video, load_image


pipe = MotifVideoImage2VideoPipeline.from_pretrained(
"Motif-Technologies/Motif-Video-2B",
torch_dtype=torch.bfloat16,
)
pipe.to("cuda")

image = load_image("input_image.png")
prompt = "A cinematic scene with vivid colors."
negative_prompt = "worst quality, blurry, jittery, distorted"

video = pipe(
image=image,
prompt=prompt,
negative_prompt=negative_prompt,
width=1280,
height=736,
num_frames=121,
num_inference_steps=50,
).frames[0]
export_to_video(video, "i2v_output.mp4", fps=24)
```

### Memory-efficient Inference

For GPUs with less than 30GB VRAM (e.g., RTX 4090), use model CPU offloading:

```bash
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
```

```python
import torch
from diffusers import MotifVideoPipeline
from diffusers.utils import export_to_video


pipe = MotifVideoPipeline.from_pretrained(
"Motif-Technologies/Motif-Video-2B",
torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()

prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair."
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"

video = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
width=1280,
height=736,
num_frames=121,
num_inference_steps=50,
).frames[0]
export_to_video(video, "output.mp4", fps=24)
```

## MotifVideoPipeline

[[autodoc]] MotifVideoPipeline
- all
- __call__

## MotifVideoImage2VideoPipeline

[[autodoc]] MotifVideoImage2VideoPipeline
- all
- __call__

## MotifVideoPipelineOutput

[[autodoc]] pipelines.motif_video.pipeline_output.MotifVideoPipelineOutput
8 changes: 8 additions & 0 deletions src/diffusers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -265,6 +265,7 @@
"LuminaNextDiT2DModel",
"MochiTransformer3DModel",
"ModelMixin",
"MotifVideoTransformer3DModel",
"MotionAdapter",
"MultiAdapter",
"MultiControlNetModel",
Expand Down Expand Up @@ -637,6 +638,9 @@
"MarigoldIntrinsicsPipeline",
"MarigoldNormalsPipeline",
"MochiPipeline",
"MotifVideoImage2VideoPipeline",
"MotifVideoPipeline",
"MotifVideoPipelineOutput",
"MusicLDMPipeline",
"NucleusMoEImagePipeline",
"OmniGenPipeline",
Expand Down Expand Up @@ -1087,6 +1091,7 @@
LuminaNextDiT2DModel,
MochiTransformer3DModel,
ModelMixin,
MotifVideoTransformer3DModel,
MotionAdapter,
MultiAdapter,
MultiControlNetModel,
Expand Down Expand Up @@ -1434,6 +1439,9 @@
MarigoldIntrinsicsPipeline,
MarigoldNormalsPipeline,
MochiPipeline,
MotifVideoImage2VideoPipeline,
MotifVideoPipeline,
MotifVideoPipelineOutput,
MusicLDMPipeline,
NucleusMoEImagePipeline,
OmniGenPipeline,
Expand Down
20 changes: 20 additions & 0 deletions src/diffusers/hooks/_helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,10 @@ def _register_transformer_blocks_metadata():
from ..models.transformers.transformer_kandinsky import Kandinsky5TransformerDecoderBlock
from ..models.transformers.transformer_ltx import LTXVideoTransformerBlock
from ..models.transformers.transformer_mochi import MochiTransformerBlock
from ..models.transformers.transformer_motif_video import (
MotifVideoSingleTransformerBlock,
MotifVideoTransformerBlock,
)
from ..models.transformers.transformer_qwenimage import QwenImageTransformerBlock
from ..models.transformers.transformer_wan import WanTransformerBlock
from ..models.transformers.transformer_z_image import ZImageTransformerBlock
Expand Down Expand Up @@ -290,6 +294,22 @@ def _register_transformer_blocks_metadata():
),
)

# MotifVideo
TransformerBlockRegistry.register(
model_class=MotifVideoTransformerBlock,
metadata=TransformerBlockMetadata(
return_hidden_states_index=0,
return_encoder_hidden_states_index=1,
),
)
TransformerBlockRegistry.register(
model_class=MotifVideoSingleTransformerBlock,
metadata=TransformerBlockMetadata(
return_hidden_states_index=0,
return_encoder_hidden_states_index=1,
),
)

# Wan
TransformerBlockRegistry.register(
model_class=WanTransformerBlock,
Expand Down
24 changes: 20 additions & 4 deletions src/diffusers/loaders/single_file_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,11 @@
from typing_extensions import Self

from .. import __version__
from ..models.model_loading_utils import _caching_allocator_warmup, _determine_device_map, _expand_device_map
from ..models.model_loading_utils import (
_caching_allocator_warmup,
_determine_device_map,
_expand_device_map,
)
from ..quantizers import DiffusersAutoQuantizer
from ..utils import deprecate, is_accelerate_available, is_torch_version, logging
from ..utils.torch_utils import empty_device_cache
Expand Down Expand Up @@ -194,6 +198,10 @@
"checkpoint_mapping_fn": convert_ltx2_audio_vae_to_diffusers,
"default_subfolder": "audio_vae",
},
"MotifVideoTransformer3DModel": {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@waitingcheung
ohh, this change does not seems to be made by our formatter, no?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This change is made by us to be compatible with our GGUF checkpoints.
https://huggingface.co/Motif-Technologies/Motif-Video-2B-GGUF

"checkpoint_mapping_fn": lambda checkpoint, **kwargs: checkpoint,
"default_subfolder": "transformer",
},
}


Expand Down Expand Up @@ -336,7 +344,11 @@ def from_single_file(cls, pretrained_model_link_or_path_or_dict: str | None = No
disable_mmap = kwargs.pop("disable_mmap", False)
device_map = kwargs.pop("device_map", None)

user_agent = {"diffusers": __version__, "file_type": "single_file", "framework": "pytorch"}
user_agent = {
"diffusers": __version__,
"file_type": "single_file",
"framework": "pytorch",
}
# In order to ensure popular quantization methods are supported. Can be disable with `disable_telemetry`
if quantization_config is not None:
user_agent["quant"] = quantization_config.quant_method.value
Expand Down Expand Up @@ -393,7 +405,9 @@ def from_single_file(cls, pretrained_model_link_or_path_or_dict: str | None = No

config_mapping_kwargs = _get_mapping_function_kwargs(config_mapping_fn, **kwargs)
diffusers_model_config = config_mapping_fn(
original_config=original_config, checkpoint=checkpoint, **config_mapping_kwargs
original_config=original_config,
checkpoint=checkpoint,
**config_mapping_kwargs,
)
else:
if config is not None:
Expand Down Expand Up @@ -465,7 +479,9 @@ def from_single_file(cls, pretrained_model_link_or_path_or_dict: str | None = No

if _should_convert_state_dict_to_diffusers(model_state_dict, checkpoint):
diffusers_format_checkpoint = checkpoint_mapping_fn(
config=diffusers_model_config, checkpoint=checkpoint, **checkpoint_mapping_kwargs
config=diffusers_model_config,
checkpoint=checkpoint,
**checkpoint_mapping_kwargs,
)
else:
diffusers_format_checkpoint = checkpoint
Expand Down
2 changes: 2 additions & 0 deletions src/diffusers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,7 @@
_import_structure["transformers.transformer_ltx2"] = ["LTX2VideoTransformer3DModel"]
_import_structure["transformers.transformer_lumina2"] = ["Lumina2Transformer2DModel"]
_import_structure["transformers.transformer_mochi"] = ["MochiTransformer3DModel"]
_import_structure["transformers.transformer_motif_video"] = ["MotifVideoTransformer3DModel"]
_import_structure["transformers.transformer_nucleusmoe_image"] = ["NucleusMoEImageTransformer2DModel"]
_import_structure["transformers.transformer_omnigen"] = ["OmniGenTransformer2DModel"]
_import_structure["transformers.transformer_ovis_image"] = ["OvisImageTransformer2DModel"]
Expand Down Expand Up @@ -249,6 +250,7 @@
Lumina2Transformer2DModel,
LuminaNextDiT2DModel,
MochiTransformer3DModel,
MotifVideoTransformer3DModel,
NucleusMoEImageTransformer2DModel,
OmniGenTransformer2DModel,
OvisImageTransformer2DModel,
Expand Down
1 change: 1 addition & 0 deletions src/diffusers/models/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@
from .transformer_ltx2 import LTX2VideoTransformer3DModel
from .transformer_lumina2 import Lumina2Transformer2DModel
from .transformer_mochi import MochiTransformer3DModel
from .transformer_motif_video import MotifVideoTransformer3DModel
from .transformer_nucleusmoe_image import NucleusMoEImageTransformer2DModel
from .transformer_omnigen import OmniGenTransformer2DModel
from .transformer_ovis_image import OvisImageTransformer2DModel
Expand Down
Loading
Loading