-
Notifications
You must be signed in to change notification settings - Fork 88
Description
Thanks for opening source such great work! I have a problem about the missalignment between the generated video and the input low quality video. According to the largest_8n1_leq function the length of the low quallity video is 8n+1, while the length of the noise input to the diffusion model is 2n, corresponding to 8n-3 frames of the generated video. Moreover, the first 6 noise takes first 25 frames of low quality video as condition and each subsequent noise corresponds to 4 frames of low-resolution video. 8n+1 frames of low quality video are needed as input to the diffusion model. While when the latents are converted into video by the TCDecoder, only 8n-3 frames of low quality video are needed. Why is there a missalignment between the input video and the generated video?