Skip to content

Commit 6156cf8

Browse files
yiyixuxuyiyi@huggingface.coyiyi@huggingface.cosayakpaulgithub-actions[bot]
authored
Hunyuanvideo15 (#12696)
* add --------- Co-authored-by: yiyi@huggingface.co <yiyi@ip-26-0-161-123.ec2.internal> Co-authored-by: yiyi@huggingface.co <yiyi@ip-26-0-160-103.ec2.internal> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
1 parent 152f7ca commit 6156cf8

23 files changed

+5170
-0
lines changed

docs/source/en/_toctree.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -359,6 +359,8 @@
359359
title: HunyuanDiT2DModel
360360
- local: api/models/hunyuanimage_transformer_2d
361361
title: HunyuanImageTransformer2DModel
362+
- local: api/models/hunyuan_video15_transformer_3d
363+
title: HunyuanVideo15Transformer3DModel
362364
- local: api/models/hunyuan_video_transformer_3d
363365
title: HunyuanVideoTransformer3DModel
364366
- local: api/models/latte_transformer3d
@@ -433,6 +435,8 @@
433435
title: AutoencoderKLHunyuanImageRefiner
434436
- local: api/models/autoencoder_kl_hunyuan_video
435437
title: AutoencoderKLHunyuanVideo
438+
- local: api/models/autoencoder_kl_hunyuan_video15
439+
title: AutoencoderKLHunyuanVideo15
436440
- local: api/models/autoencoderkl_ltx_video
437441
title: AutoencoderKLLTXVideo
438442
- local: api/models/autoencoderkl_magvit
@@ -652,6 +656,8 @@
652656
title: Framepack
653657
- local: api/pipelines/hunyuan_video
654658
title: HunyuanVideo
659+
- local: api/pipelines/hunyuan_video15
660+
title: HunyuanVideo1.5
655661
- local: api/pipelines/i2vgenxl
656662
title: I2VGen-XL
657663
- local: api/pipelines/kandinsky5_video
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
<!-- Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# AutoencoderKLHunyuanVideo15
13+
14+
The 3D variational autoencoder (VAE) model with KL loss used in [HunyuanVideo1.5](https://github.com/Tencent/HunyuanVideo1-1.5) by Tencent.
15+
16+
The model can be loaded with the following code snippet.
17+
18+
```python
19+
from diffusers import AutoencoderKLHunyuanVideo15
20+
21+
vae = AutoencoderKLHunyuanVideo15.from_pretrained("hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-480p_t2v", subfolder="vae", torch_dtype=torch.float32)
22+
23+
# make sure to enable tiling to avoid OOM
24+
vae.enable_tiling()
25+
```
26+
27+
## AutoencoderKLHunyuanVideo15
28+
29+
[[autodoc]] AutoencoderKLHunyuanVideo15
30+
- decode
31+
- encode
32+
- all
33+
34+
## DecoderOutput
35+
36+
[[autodoc]] models.autoencoders.vae.DecoderOutput
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
<!-- Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# HunyuanVideo15Transformer3DModel
13+
14+
A Diffusion Transformer model for 3D video-like data used in [HunyuanVideo1.5](https://github.com/Tencent/HunyuanVideo1-1.5).
15+
16+
The model can be loaded with the following code snippet.
17+
18+
```python
19+
from diffusers import HunyuanVideo15Transformer3DModel
20+
21+
transformer = HunyuanVideo15Transformer3DModel.from_pretrained("hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-480p_t2v" subfolder="transformer", torch_dtype=torch.bfloat16)
22+
```
23+
24+
## HunyuanVideo15Transformer3DModel
25+
26+
[[autodoc]] HunyuanVideo15Transformer3DModel
27+
28+
## Transformer2DModelOutput
29+
30+
[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
<!-- Copyright 2025 The HuggingFace Team. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License. -->
14+
15+
16+
# HunyuanVideo-1.5
17+
18+
HunyuanVideo-1.5 is a lightweight yet powerful video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs. This achievement is built upon several key components, including meticulous data curation, an advanced DiT architecture with selective and sliding tile attention (SSTA), enhanced bilingual understanding through glyph-aware text encoding, progressive pre-training and post-training, and an efficient video super-resolution network. Leveraging these designs, we developed a unified framework capable of high-quality text-to-video and image-to-video generation across multiple durations and resolutions. Extensive experiments demonstrate that this compact and proficient model establishes a new state-of-the-art among open-source models.
19+
20+
You can find all the original HunyuanVideo checkpoints under the [Tencent](https://huggingface.co/tencent) organization.
21+
22+
> [!TIP]
23+
> Click on the HunyuanVideo models in the right sidebar for more examples of video generation tasks.
24+
>
25+
> The examples below use a checkpoint from [hunyuanvideo-community](https://huggingface.co/hunyuanvideo-community) because the weights are stored in a layout compatible with Diffusers.
26+
27+
The example below demonstrates how to generate a video optimized for memory or inference speed.
28+
29+
<hfoptions id="usage">
30+
<hfoption id="memory">
31+
32+
Refer to the [Reduce memory usage](../../optimization/memory) guide for more details about the various memory saving techniques.
33+
34+
35+
```py
36+
import torch
37+
from diffusers import AutoModel, HunyuanVideo15Pipeline
38+
from diffusers.utils import export_to_video
39+
40+
41+
pipeline = HunyuanVideo15Pipeline.from_pretrained(
42+
"HunyuanVideo-1.5-Diffusers-480p_t2v",
43+
torch_dtype=torch.bfloat16,
44+
)
45+
46+
# model-offloading and tiling
47+
pipeline.enable_model_cpu_offload()
48+
pipeline.vae.enable_tiling()
49+
50+
prompt = "A fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys."
51+
video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0]
52+
export_to_video(video, "output.mp4", fps=15)
53+
```
54+
55+
## Notes
56+
57+
- HunyuanVideo1.5 use attention masks with variable-length sequences. For best performance, we recommend using an attention backend that handles padding efficiently.
58+
59+
- **H100/H800:** `_flash_3_hub` or `_flash_varlen_3`
60+
- **A100/A800/RTX 4090:** `flash_hub` or `flash_varlen`
61+
- **Other GPUs:** `sage_hub`
62+
63+
Refer to the [Attention backends](../../optimization/attention_backends) guide for more details about using a different backend.
64+
65+
66+
```py
67+
pipe.transformer.set_attention_backend("flash_hub") # or your preferred backend
68+
```
69+
70+
- [`HunyuanVideo15Pipeline`] use guider and does not take `guidance_scale` parameter at runtime.
71+
72+
You can check the default guider configuration using `pipe.guider`:
73+
74+
```py
75+
>>> pipe.guider
76+
ClassifierFreeGuidance {
77+
"_class_name": "ClassifierFreeGuidance",
78+
"_diffusers_version": "0.36.0.dev0",
79+
"enabled": true,
80+
"guidance_rescale": 0.0,
81+
"guidance_scale": 6.0,
82+
"start": 0.0,
83+
"stop": 1.0,
84+
"use_original_formulation": false
85+
}
86+
87+
State:
88+
step: None
89+
num_inference_steps: None
90+
timestep: None
91+
count_prepared: 0
92+
enabled: True
93+
num_conditions: 2
94+
```
95+
96+
To update guider configuration, you can run `pipe.guider = pipe.guider.new(...)`
97+
98+
```py
99+
pipe.guider = pipe.guider.new(guidance_scale=5.0)
100+
```
101+
102+
Read more on Guider [here](../../modular_diffusers/guiders).
103+
104+
105+
106+
## HunyuanVideo15Pipeline
107+
108+
[[autodoc]] HunyuanVideo15Pipeline
109+
- all
110+
- __call__
111+
112+
## HunyuanVideo15ImageToVideoPipeline
113+
114+
[[autodoc]] HunyuanVideo15ImageToVideoPipeline
115+
- all
116+
- __call__
117+
118+
## HunyuanVideo15PipelineOutput
119+
120+
[[autodoc]] pipelines.hunyuan_video1_5.pipeline_output.HunyuanVideo15PipelineOutput

0 commit comments

Comments
 (0)