missalignment problem

Thanks for opening source such great work! I have a problem about the missalignment between the generated video and the input low quality video. According to the [largest_8n1_leq](https://github.com/OpenImagingLab/FlashVSR/blob/914dcd4b3b1155d8a10028177b7473d4f42d47da/examples/WanVSR/infer_flashvsr_v1.1_tiny.py#L31) function the length of the low quallity video is 8n+1, while the length of the noise input to the diffusion model is 2n, corresponding to 8n-3 frames of the generated video. Moreover, the first 6 noise takes first 25 frames of low quality video as condition and each subsequent noise corresponds to 4 frames of low-resolution video. 8n+1 frames of low quality video are needed as input to the diffusion model. While when the latents are converted into video by the [TCDecoder](https://github.com/OpenImagingLab/FlashVSR/blob/914dcd4b3b1155d8a10028177b7473d4f42d47da/diffsynth/pipelines/flashvsr_tiny.py#L417), only 8n-3 frames of low quality video are needed. Why is there a missalignment between the input video and the generated video?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

missalignment problem #46

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

missalignment problem #46

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions