Skip to content

Commit 065d982

Browse files
committed
up
1 parent c677ee7 commit 065d982

File tree

1 file changed

+129
-129
lines changed
  • docs/source/en/api/pipelines

1 file changed

+129
-129
lines changed

docs/source/en/api/pipelines/wan.md

Lines changed: 129 additions & 129 deletions
Original file line numberDiff line numberDiff line change
@@ -40,8 +40,8 @@ The following Wan models are supported in Diffusers:
4040
- [Wan 2.2 T2V 14B](https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B-Diffusers)
4141
- [Wan 2.2 I2V 14B](https://huggingface.co/Wan-AI/Wan2.2-I2V-A14B-Diffusers)
4242
- [Wan 2.2 TI2V 5B](https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B-Diffusers)
43-
- [Wan 2.2 S2V 14B](https://huggingface.co/Wan-AI/Wan2.2-S2V-14B-Diffusers)
4443
- [Wan 2.2 Animate 14B](https://huggingface.co/Wan-AI/Wan2.2-Animate-14B-Diffusers)
44+
- [Wan 2.2 S2V 14B](https://huggingface.co/Wan-AI/Wan2.2-S2V-14B-Diffusers)
4545

4646
> [!TIP]
4747
> Click on the Wan models in the right sidebar for more examples of video generation.
@@ -239,128 +239,6 @@ export_to_video(output, "output.mp4", fps=16)
239239
</hfoptions>
240240

241241

242-
### Wan-S2V: Audio-Driven Cinematic Video Generation
243-
244-
[Wan-S2V](https://huggingface.co/papers/2508.18621) by the Wan Team.
245-
246-
*Current state-of-the-art (SOTA) methods for audio-driven character animation demonstrate promising performance for scenarios primarily involving speech and singing. However, they often fall short in more complex film and television productions, which demand sophisticated elements such as nuanced character interactions, realistic body movements, and dynamic camera work. To address this long-standing challenge of achieving film-level character animation, we propose an audio-driven model, which we refere to as Wan-S2V, built upon Wan. Our model achieves significantly enhanced expressiveness and fidelity in cinematic contexts compared to existing approaches. We conducted extensive experiments, benchmarking our method against cutting-edge models such as Hunyuan-Avatar and Omnihuman. The experimental results consistently demonstrate that our approach significantly outperforms these existing solutions. Additionally, we explore the versatility of our method through its applications in long-form video generation and precise video lip-sync editing.*
247-
248-
The project page: https://humanaigc.github.io/wan-s2v-webpage/
249-
250-
This model was contributed by [M. Tolga Cangöz](https://github.com/tolgacangoz).
251-
252-
The example below demonstrates how to use the speech-to-video pipeline to generate a video using a text description, a starting frame, an audio, and a pose video.
253-
254-
<hfoptions id="S2V usage">
255-
<hfoption id="usage">
256-
257-
```python
258-
import numpy as np, math
259-
import torch
260-
from diffusers import AutoencoderKLWan, WanSpeechToVideoPipeline
261-
from diffusers.utils import export_to_merged_video_audio, load_image, load_audio, load_video, export_to_video
262-
from transformers import Wav2Vec2ForCTC
263-
import requests
264-
from PIL import Image
265-
from io import BytesIO
266-
267-
268-
model_id = "Wan-AI/Wan2.2-S2V-14B-Diffusers"
269-
audio_encoder = Wav2Vec2ForCTC.from_pretrained(model_id, subfolder="audio_encoder", dtype=torch.float32)
270-
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
271-
pipe = WanSpeechToVideoPipeline.from_pretrained(
272-
model_id, vae=vae, audio_encoder=audio_encoder, torch_dtype=torch.bfloat16
273-
)
274-
pipe.to("cuda")
275-
276-
headers = {"User-Agent": "Mozilla/5.0"}
277-
url = "https://upload.wikimedia.org/wikipedia/commons/4/46/Albert_Einstein_sticks_his_tongue.jpg"
278-
resp = requests.get(url, headers=headers, timeout=30)
279-
image = Image.open(BytesIO(resp.content))
280-
281-
audio, sampling_rate = load_audio("https://github.com/Wan-Video/Wan2.2/raw/refs/heads/main/examples/Five%20Hundred%20Miles.MP3")
282-
#pose_video_path_or_url = "https://github.com/Wan-Video/Wan2.2/raw/refs/heads/main/examples/pose.mp4"
283-
284-
def get_size_less_than_area(height,
285-
width,
286-
target_area=1024 * 704,
287-
divisor=64):
288-
if height * width <= target_area:
289-
# If the original image area is already less than or equal to the target,
290-
# no resizing is needed—just padding. Still need to ensure that the padded area doesn't exceed the target.
291-
max_upper_area = target_area
292-
min_scale = 0.1
293-
max_scale = 1.0
294-
else:
295-
# Resize to fit within the target area and then pad to multiples of `divisor`
296-
max_upper_area = target_area # Maximum allowed total pixel count after padding
297-
d = divisor - 1
298-
b = d * (height + width)
299-
a = height * width
300-
c = d**2 - max_upper_area
301-
302-
# Calculate scale boundaries using quadratic equation
303-
min_scale = (-b + math.sqrt(b**2 - 2 * a * c)) / (2 * a) # Scale when maximum padding is applied
304-
max_scale = math.sqrt(max_upper_area / (height * width)) # Scale without any padding
305-
306-
# We want to choose the largest possible scale such that the final padded area does not exceed max_upper_area
307-
# Use binary search-like iteration to find this scale
308-
find_it = False
309-
for i in range(100):
310-
scale = max_scale - (max_scale - min_scale) * i / 100
311-
new_height, new_width = int(height * scale), int(width * scale)
312-
313-
# Pad to make dimensions divisible by 64
314-
pad_height = (64 - new_height % 64) % 64
315-
pad_width = (64 - new_width % 64) % 64
316-
pad_top = pad_height // 2
317-
pad_bottom = pad_height - pad_top
318-
pad_left = pad_width // 2
319-
pad_right = pad_width - pad_left
320-
321-
padded_height, padded_width = new_height + pad_height, new_width + pad_width
322-
323-
if padded_height * padded_width <= max_upper_area:
324-
find_it = True
325-
break
326-
327-
if find_it:
328-
return padded_height, padded_width
329-
else:
330-
# Fallback: calculate target dimensions based on aspect ratio and divisor alignment
331-
aspect_ratio = width / height
332-
target_width = int(
333-
(target_area * aspect_ratio)**0.5 // divisor * divisor)
334-
target_height = int(
335-
(target_area / aspect_ratio)**0.5 // divisor * divisor)
336-
337-
# Ensure the result is not larger than the original resolution
338-
if target_width >= width or target_height >= height:
339-
target_width = int(width // divisor * divisor)
340-
target_height = int(height // divisor * divisor)
341-
342-
return target_height, target_width
343-
344-
height, width = get_size_less_than_area(first_frame.height, first_frame.width, 480*832)
345-
346-
prompt = "Einstein singing a song."
347-
348-
output = pipe(
349-
prompt=prompt, image=image, audio=audio, sampling_rate=sampling_rate,
350-
height=height, width=width, num_frames_per_chunk=80,
351-
#pose_video_path_or_url=pose_video_path_or_url,
352-
).frames[0]
353-
export_to_video(output, "output.mp4", fps=16)
354-
355-
# Lastly, we need to merge the video and audio into a new video, with the duration set to
356-
# the shorter of the two and overwrite the original video file.
357-
export_to_merged_video_audio("output.mp4", "audio.mp3")
358-
```
359-
360-
</hfoption>
361-
</hfoptions>
362-
363-
364242
### Any-to-Video Controllable Generation
365243

366244
Wan VACE supports various generation techniques which achieve controllable video generation. Some of the capabilities include:
@@ -588,6 +466,128 @@ export_to_video(output, "animated_advanced.mp4", fps=16)
588466
- **num_frames**: Total number of frames to generate. Should be divisible by `vae_scale_factor_temporal` (default: 4)
589467

590468

469+
### Wan-S2V: Audio-Driven Cinematic Video Generation
470+
471+
[Wan-S2V](https://huggingface.co/papers/2508.18621) by the Wan Team.
472+
473+
*Current state-of-the-art (SOTA) methods for audio-driven character animation demonstrate promising performance for scenarios primarily involving speech and singing. However, they often fall short in more complex film and television productions, which demand sophisticated elements such as nuanced character interactions, realistic body movements, and dynamic camera work. To address this long-standing challenge of achieving film-level character animation, we propose an audio-driven model, which we refere to as Wan-S2V, built upon Wan. Our model achieves significantly enhanced expressiveness and fidelity in cinematic contexts compared to existing approaches. We conducted extensive experiments, benchmarking our method against cutting-edge models such as Hunyuan-Avatar and Omnihuman. The experimental results consistently demonstrate that our approach significantly outperforms these existing solutions. Additionally, we explore the versatility of our method through its applications in long-form video generation and precise video lip-sync editing.*
474+
475+
The project page: https://humanaigc.github.io/wan-s2v-webpage/
476+
477+
This model was contributed by [M. Tolga Cangöz](https://github.com/tolgacangoz).
478+
479+
The example below demonstrates how to use the speech-to-video pipeline to generate a video using a text description, a starting frame, an audio, and a pose video.
480+
481+
<hfoptions id="S2V usage">
482+
<hfoption id="usage">
483+
484+
```python
485+
import numpy as np, math
486+
import torch
487+
from diffusers import AutoencoderKLWan, WanSpeechToVideoPipeline
488+
from diffusers.utils import export_to_merged_video_audio, load_image, load_audio, load_video, export_to_video
489+
from transformers import Wav2Vec2ForCTC
490+
import requests
491+
from PIL import Image
492+
from io import BytesIO
493+
494+
495+
model_id = "Wan-AI/Wan2.2-S2V-14B-Diffusers"
496+
audio_encoder = Wav2Vec2ForCTC.from_pretrained(model_id, subfolder="audio_encoder", dtype=torch.float32)
497+
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
498+
pipe = WanSpeechToVideoPipeline.from_pretrained(
499+
model_id, vae=vae, audio_encoder=audio_encoder, torch_dtype=torch.bfloat16
500+
)
501+
pipe.to("cuda")
502+
503+
headers = {"User-Agent": "Mozilla/5.0"}
504+
url = "https://upload.wikimedia.org/wikipedia/commons/4/46/Albert_Einstein_sticks_his_tongue.jpg"
505+
resp = requests.get(url, headers=headers, timeout=30)
506+
image = Image.open(BytesIO(resp.content))
507+
508+
audio, sampling_rate = load_audio("https://github.com/Wan-Video/Wan2.2/raw/refs/heads/main/examples/Five%20Hundred%20Miles.MP3")
509+
#pose_video_path_or_url = "https://github.com/Wan-Video/Wan2.2/raw/refs/heads/main/examples/pose.mp4"
510+
511+
def get_size_less_than_area(height,
512+
width,
513+
target_area=1024 * 704,
514+
divisor=64):
515+
if height * width <= target_area:
516+
# If the original image area is already less than or equal to the target,
517+
# no resizing is needed—just padding. Still need to ensure that the padded area doesn't exceed the target.
518+
max_upper_area = target_area
519+
min_scale = 0.1
520+
max_scale = 1.0
521+
else:
522+
# Resize to fit within the target area and then pad to multiples of `divisor`
523+
max_upper_area = target_area # Maximum allowed total pixel count after padding
524+
d = divisor - 1
525+
b = d * (height + width)
526+
a = height * width
527+
c = d**2 - max_upper_area
528+
529+
# Calculate scale boundaries using quadratic equation
530+
min_scale = (-b + math.sqrt(b**2 - 2 * a * c)) / (2 * a) # Scale when maximum padding is applied
531+
max_scale = math.sqrt(max_upper_area / (height * width)) # Scale without any padding
532+
533+
# We want to choose the largest possible scale such that the final padded area does not exceed max_upper_area
534+
# Use binary search-like iteration to find this scale
535+
find_it = False
536+
for i in range(100):
537+
scale = max_scale - (max_scale - min_scale) * i / 100
538+
new_height, new_width = int(height * scale), int(width * scale)
539+
540+
# Pad to make dimensions divisible by 64
541+
pad_height = (64 - new_height % 64) % 64
542+
pad_width = (64 - new_width % 64) % 64
543+
pad_top = pad_height // 2
544+
pad_bottom = pad_height - pad_top
545+
pad_left = pad_width // 2
546+
pad_right = pad_width - pad_left
547+
548+
padded_height, padded_width = new_height + pad_height, new_width + pad_width
549+
550+
if padded_height * padded_width <= max_upper_area:
551+
find_it = True
552+
break
553+
554+
if find_it:
555+
return padded_height, padded_width
556+
else:
557+
# Fallback: calculate target dimensions based on aspect ratio and divisor alignment
558+
aspect_ratio = width / height
559+
target_width = int(
560+
(target_area * aspect_ratio)**0.5 // divisor * divisor)
561+
target_height = int(
562+
(target_area / aspect_ratio)**0.5 // divisor * divisor)
563+
564+
# Ensure the result is not larger than the original resolution
565+
if target_width >= width or target_height >= height:
566+
target_width = int(width // divisor * divisor)
567+
target_height = int(height // divisor * divisor)
568+
569+
return target_height, target_width
570+
571+
height, width = get_size_less_than_area(first_frame.height, first_frame.width, 480*832)
572+
573+
prompt = "Einstein singing a song."
574+
575+
output = pipe(
576+
prompt=prompt, image=image, audio=audio, sampling_rate=sampling_rate,
577+
height=height, width=width, num_frames_per_chunk=80,
578+
#pose_video_path_or_url=pose_video_path_or_url,
579+
).frames[0]
580+
export_to_video(output, "output.mp4", fps=16)
581+
582+
# Lastly, we need to merge the video and audio into a new video, with the duration set to
583+
# the shorter of the two and overwrite the original video file.
584+
export_to_merged_video_audio("output.mp4", "audio.mp3")
585+
```
586+
587+
</hfoption>
588+
</hfoptions>
589+
590+
591591
## Notes
592592

593593
- Wan2.1 supports LoRAs with [`~loaders.WanLoraLoaderMixin.load_lora_weights`].
@@ -692,12 +692,6 @@ export_to_video(output, "animated_advanced.mp4", fps=16)
692692
- all
693693
- __call__
694694

695-
## WanSpeechToVideoPipeline
696-
697-
[[autodoc]] WanSpeechToVideoPipeline
698-
- all
699-
- __call__
700-
701695
## WanVideoToVideoPipeline
702696

703697
[[autodoc]] WanVideoToVideoPipeline
@@ -710,6 +704,12 @@ export_to_video(output, "animated_advanced.mp4", fps=16)
710704
- all
711705
- __call__
712706

707+
## WanSpeechToVideoPipeline
708+
709+
[[autodoc]] WanSpeechToVideoPipeline
710+
- all
711+
- __call__
712+
713713
## WanPipelineOutput
714714

715715
[[autodoc]] pipelines.wan.pipeline_output.WanPipelineOutput

0 commit comments

Comments
 (0)