-
Notifications
You must be signed in to change notification settings - Fork 198
Description
Confusion on UNet Block in SD 2.1
I am confused about the UNet architecture in Stable Diffusion V1.5/V2.1.
In the paper Adding Conditional Control to Text-to-Image Diffusion Models, the author states that the UNet Block as follows, where they claim that each block contains 4 ResNet layers and 2 ViTs and would be repeated 3 times:
We use Stable Diffusion [71] as an example to show how ControlNet can add conditional control to a large pretrained diffusion model. Stable Diffusion is essentially a U-Net [72] with an encoder, a middle block, and a skip-connected decoder. Both the encoder and decoder contain 12 blocks, and the full model contains 25 blocks, including the middle block. Of the 25 blocks, 8 blocks are down-sampling or up-sampling convolution layers, while the other 17 blocks are main blocks that each contain 4 resnet layers and 2 Vision Transformers (ViTs). Each ViT contains several crossattention and self-attention mechanisms. For example, in Figure 3a, the “SD Encoder Block A” contains 4 resnet layers and 2 ViTs, while the “×3” indicates that this block is repeated three times.
However, when I refer to pytorch+huggingface/diffusers implementation, it seems that each encoder block contains 2 ResNet layers and 2 ViTs and decoder block contains 3 ResNet layers and 3 ViTs and would not be repeated.
Supplyment
| Tensors | Shape | Precision |
|---|---|---|
| time_embedding.linear_1.bias | [1 280] | F16 |
| time_embedding.linear_1.weight | [1 280, 320] | F16 |
| time_embedding.linear_2.bias | [1 280] | F16 |
| time_embedding.linear_2.weight | [1 280, 1 280] | F16 |
| conv_in.bias | [320] | F16 |
| conv_in.weight | [320, 4, 3, 3] | F16 |
| conv_norm_out.bias | [320] | F16 |
| conv_norm_out.weight | [320] | F16 |
| conv_out.bias | [4] | F16 |
| conv_out.weight | [4, 320, 3, 3] | F16 |
| down_blocks.0.attentions.0.norm.bias | [320] | F16 |
| down_blocks.0.attentions.0.norm.weight | [320] | F16 |
| down_blocks.0.attentions.0.proj_in.bias | [320] | F16 |
| down_blocks.0.attentions.0.proj_in.weight | [320, 320] | F16 |
| down_blocks.0.attentions.0.proj_out.bias | [320] | F16 |
| down_blocks.0.attentions.0.proj_out.weight | [320, 320] | F16 |
| down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_k.weight | [320, 320] | F16 |
| down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_out.0.bias | [320] | F16 |
| down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_out.0.weight | [320, 320] | F16 |
| down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_q.weight | [320, 320] | F16 |
| down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_v.weight | [320, 320] | F16 |
| down_blocks.0.attentions.0.transformer_blocks.0.attn2.to_k.weight | [320, 1 024] | F16 |
| down_blocks.0.attentions.0.transformer_blocks.0.attn2.to_out.0.bias | [320] | F16 |
| down_blocks.0.attentions.0.transformer_blocks.0.attn2.to_out.0.weight | [320, 320] | F16 |
| down_blocks.0.attentions.0.transformer_blocks.0.attn2.to_q.weight | [320, 320] | F16 |
| down_blocks.0.attentions.0.transformer_blocks.0.attn2.to_v.weight | [320, 1 024] | F16 |
| down_blocks.0.attentions.0.transformer_blocks.0.ff.net.0.proj.bias | [2 560] | F16 |
| down_blocks.0.attentions.0.transformer_blocks.0.ff.net.0.proj.weight | [2 560, 320] | F16 |
| down_blocks.0.attentions.0.transformer_blocks.0.ff.net.2.bias | [320] | F16 |
| down_blocks.0.attentions.0.transformer_blocks.0.ff.net.2.weight | [320, 1 280] | F16 |
| down_blocks.0.attentions.0.transformer_blocks.0.norm1.bias | [320] | F16 |
| down_blocks.0.attentions.0.transformer_blocks.0.norm1.weight | [320] | F16 |
| down_blocks.0.attentions.0.transformer_blocks.0.norm2.bias | [320] | F16 |
| down_blocks.0.attentions.0.transformer_blocks.0.norm2.weight | [320] | F16 |
| down_blocks.0.attentions.0.transformer_blocks.0.norm3.bias | [320] | F16 |
| down_blocks.0.attentions.0.transformer_blocks.0.norm3.weight | [320] | F16 |
| down_blocks.0.attentions.1.norm.bias | [320] | F16 |
| down_blocks.0.attentions.1.norm.weight | [320] | F16 |
| down_blocks.0.attentions.1.proj_in.bias | [320] | F16 |
| down_blocks.0.attentions.1.proj_in.weight | [320, 320] | F16 |
| down_blocks.0.attentions.1.proj_out.bias | [320] | F16 |
| down_blocks.0.attentions.1.proj_out.weight | [320, 320] | F16 |
| down_blocks.0.attentions.1.transformer_blocks.0.attn1.to_k.weight | [320, 320] | F16 |
| down_blocks.0.attentions.1.transformer_blocks.0.attn1.to_out.0.bias | [320] | F16 |
| down_blocks.0.attentions.1.transformer_blocks.0.attn1.to_out.0.weight | [320, 320] | F16 |
| down_blocks.0.attentions.1.transformer_blocks.0.attn1.to_q.weight | [320, 320] | F16 |
| down_blocks.0.attentions.1.transformer_blocks.0.attn1.to_v.weight | [320, 320] | F16 |
| down_blocks.0.attentions.1.transformer_blocks.0.attn2.to_k.weight | [320, 1 024] | F16 |
| down_blocks.0.attentions.1.transformer_blocks.0.attn2.to_out.0.bias | [320] | F16 |
| down_blocks.0.attentions.1.transformer_blocks.0.attn2.to_out.0.weight | [320, 320] | F16 |
| down_blocks.0.attentions.1.transformer_blocks.0.attn2.to_q.weight | [320, 320] | F16 |
| down_blocks.0.attentions.1.transformer_blocks.0.attn2.to_v.weight | [320, 1 024] | F16 |
| down_blocks.0.attentions.1.transformer_blocks.0.ff.net.0.proj.bias | [2 560] | F16 |
| down_blocks.0.attentions.1.transformer_blocks.0.ff.net.0.proj.weight | [2 560, 320] | F16 |
| down_blocks.0.attentions.1.transformer_blocks.0.ff.net.2.bias | [320] | F16 |
| down_blocks.0.attentions.1.transformer_blocks.0.ff.net.2.weight | [320, 1 280] | F16 |
| down_blocks.0.attentions.1.transformer_blocks.0.norm1.bias | [320] | F16 |
| down_blocks.0.attentions.1.transformer_blocks.0.norm1.weight | [320] | F16 |
| down_blocks.0.attentions.1.transformer_blocks.0.norm2.bias | [320] | F16 |
| down_blocks.0.attentions.1.transformer_blocks.0.norm2.weight | [320] | F16 |
| down_blocks.0.attentions.1.transformer_blocks.0.norm3.bias | [320] | F16 |
| down_blocks.0.attentions.1.transformer_blocks.0.norm3.weight | [320] | F16 |
| down_blocks.0.downsamplers.0.conv.bias | [320] | F16 |
| down_blocks.0.downsamplers.0.conv.weight | [320, 320, 3, 3] | F16 |
| down_blocks.0.resnets.0.conv1.bias | [320] | F16 |
| down_blocks.0.resnets.0.conv1.weight | [320, 320, 3, 3] | F16 |
| down_blocks.0.resnets.0.conv2.bias | [320] | F16 |
| down_blocks.0.resnets.0.conv2.weight | [320, 320, 3, 3] | F16 |
| down_blocks.0.resnets.0.norm1.bias | [320] | F16 |
| down_blocks.0.resnets.0.norm1.weight | [320] | F16 |
| down_blocks.0.resnets.0.norm2.bias | [320] | F16 |
| down_blocks.0.resnets.0.norm2.weight | [320] | F16 |
| down_blocks.0.resnets.0.time_emb_proj.bias | [320] | F16 |
| down_blocks.0.resnets.0.time_emb_proj.weight | [320, 1 280] | F16 |
| down_blocks.0.resnets.1.conv1.bias | [320] | F16 |
| down_blocks.0.resnets.1.conv1.weight | [320, 320, 3, 3] | F16 |
| down_blocks.0.resnets.1.conv2.bias | [320] | F16 |
| down_blocks.0.resnets.1.conv2.weight | [320, 320, 3, 3] | F16 |
| down_blocks.0.resnets.1.norm1.bias | [320] | F16 |
| down_blocks.0.resnets.1.norm1.weight | [320] | F16 |
| down_blocks.0.resnets.1.norm2.bias | [320] | F16 |
| down_blocks.0.resnets.1.norm2.weight | [320] | F16 |
| down_blocks.0.resnets.1.time_emb_proj.bias | [320] | F16 |
| down_blocks.0.resnets.1.time_emb_proj.weight | [320, 1 280] | F16 |
| down_blocks.1.attentions.0.norm.bias | [640] | F16 |
| down_blocks.1.attentions.0.norm.weight | [640] | F16 |
| down_blocks.1.attentions.0.proj_in.bias | [640] | F16 |
| down_blocks.1.attentions.0.proj_in.weight | [640, 640] | F16 |
| down_blocks.1.attentions.0.proj_out.bias | [640] | F16 |
| down_blocks.1.attentions.0.proj_out.weight | [640, 640] | F16 |
| down_blocks.1.attentions.0.transformer_blocks.0.attn1.to_k.weight | [640, 640] | F16 |
| down_blocks.1.attentions.0.transformer_blocks.0.attn1.to_out.0.bias | [640] | F16 |
| down_blocks.1.attentions.0.transformer_blocks.0.attn1.to_out.0.weight | [640, 640] | F16 |
| down_blocks.1.attentions.0.transformer_blocks.0.attn1.to_q.weight | [640, 640] | F16 |
| down_blocks.1.attentions.0.transformer_blocks.0.attn1.to_v.weight | [640, 640] | F16 |
| down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_k.weight | [640, 1 024] | F16 |
| down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_out.0.bias | [640] | F16 |
| down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_out.0.weight | [640, 640] | F16 |
| down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_q.weight | [640, 640] | F16 |
| down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_v.weight | [640, 1 024] | F16 |
| down_blocks.1.attentions.0.transformer_blocks.0.ff.net.0.proj.bias | [5 120] | F16 |
| down_blocks.1.attentions.0.transformer_blocks.0.ff.net.0.proj.weight | [5 120, 640] | F16 |
| down_blocks.1.attentions.0.transformer_blocks.0.ff.net.2.bias | [640] | F16 |
| down_blocks.1.attentions.0.transformer_blocks.0.ff.net.2.weight | [640, 2 560] | F16 |
| down_blocks.1.attentions.0.transformer_blocks.0.norm1.bias | [640] | F16 |
| down_blocks.1.attentions.0.transformer_blocks.0.norm1.weight | [640] | F16 |
| down_blocks.1.attentions.0.transformer_blocks.0.norm2.bias | [640] | F16 |
| down_blocks.1.attentions.0.transformer_blocks.0.norm2.weight | [640] | F16 |
| down_blocks.1.attentions.0.transformer_blocks.0.norm3.bias | [640] | F16 |
| down_blocks.1.attentions.0.transformer_blocks.0.norm3.weight | [640] | F16 |
| down_blocks.1.attentions.1.norm.bias | [640] | F16 |
| down_blocks.1.attentions.1.norm.weight | [640] | F16 |
| down_blocks.1.attentions.1.proj_in.bias | [640] | F16 |
| down_blocks.1.attentions.1.proj_in.weight | [640, 640] | F16 |
| down_blocks.1.attentions.1.proj_out.bias | [640] | F16 |
| down_blocks.1.attentions.1.proj_out.weight | [640, 640] | F16 |
| down_blocks.1.attentions.1.transformer_blocks.0.attn1.to_k.weight | [640, 640] | F16 |
| down_blocks.1.attentions.1.transformer_blocks.0.attn1.to_out.0.bias | [640] | F16 |
| down_blocks.1.attentions.1.transformer_blocks.0.attn1.to_out.0.weight | [640, 640] | F16 |
| down_blocks.1.attentions.1.transformer_blocks.0.attn1.to_q.weight | [640, 640] | F16 |
| down_blocks.1.attentions.1.transformer_blocks.0.attn1.to_v.weight | [640, 640] | F16 |
| down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_k.weight | [640, 1 024] | F16 |
| down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_out.0.bias | [640] | F16 |
| down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_out.0.weight | [640, 640] | F16 |
| down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_q.weight | [640, 640] | F16 |
| down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_v.weight | [640, 1 024] | F16 |
| down_blocks.1.attentions.1.transformer_blocks.0.ff.net.0.proj.bias | [5 120] | F16 |
| down_blocks.1.attentions.1.transformer_blocks.0.ff.net.0.proj.weight | [5 120, 640] | F16 |
| down_blocks.1.attentions.1.transformer_blocks.0.ff.net.2.bias | [640] | F16 |
| down_blocks.1.attentions.1.transformer_blocks.0.ff.net.2.weight | [640, 2 560] | F16 |
| down_blocks.1.attentions.1.transformer_blocks.0.norm1.bias | [640] | F16 |
| down_blocks.1.attentions.1.transformer_blocks.0.norm1.weight | [640] | F16 |
| down_blocks.1.attentions.1.transformer_blocks.0.norm2.bias | [640] | F16 |
| down_blocks.1.attentions.1.transformer_blocks.0.norm2.weight | [640] | F16 |
| down_blocks.1.attentions.1.transformer_blocks.0.norm3.bias | [640] | F16 |
| down_blocks.1.attentions.1.transformer_blocks.0.norm3.weight | [640] | F16 |
| down_blocks.1.downsamplers.0.conv.bias | [640] | F16 |
| down_blocks.1.downsamplers.0.conv.weight | [640, 640, 3, 3] | F16 |
| down_blocks.1.resnets.0.conv1.bias | [640] | F16 |
| down_blocks.1.resnets.0.conv1.weight | [640, 320, 3, 3] | F16 |
| down_blocks.1.resnets.0.conv2.bias | [640] | F16 |
| down_blocks.1.resnets.0.conv2.weight | [640, 640, 3, 3] | F16 |
| down_blocks.1.resnets.0.conv_shortcut.bias | [640] | F16 |
| down_blocks.1.resnets.0.conv_shortcut.weight | [640, 320, 1, 1] | F16 |
| down_blocks.1.resnets.0.norm1.bias | [320] | F16 |
| down_blocks.1.resnets.0.norm1.weight | [320] | F16 |
| down_blocks.1.resnets.0.norm2.bias | [640] | F16 |
| down_blocks.1.resnets.0.norm2.weight | [640] | F16 |
| down_blocks.1.resnets.0.time_emb_proj.bias | [640] | F16 |
| down_blocks.1.resnets.0.time_emb_proj.weight | [640, 1 280] | F16 |
| down_blocks.1.resnets.1.conv1.bias | [640] | F16 |
| down_blocks.1.resnets.1.conv1.weight | [640, 640, 3, 3] | F16 |
| down_blocks.1.resnets.1.conv2.bias | [640] | F16 |
| down_blocks.1.resnets.1.conv2.weight | [640, 640, 3, 3] | F16 |
| down_blocks.1.resnets.1.norm1.bias | [640] | F16 |
| down_blocks.1.resnets.1.norm1.weight | [640] | F16 |
| down_blocks.1.resnets.1.norm2.bias | [640] | F16 |
| down_blocks.1.resnets.1.norm2.weight | [640] | F16 |
| down_blocks.1.resnets.1.time_emb_proj.bias | [640] | F16 |
| down_blocks.1.resnets.1.time_emb_proj.weight | [640, 1 280] | F16 |
| down_blocks.2.attentions.0.norm.bias | [1 280] | F16 |
| down_blocks.2.attentions.0.norm.weight | [1 280] | F16 |
| down_blocks.2.attentions.0.proj_in.bias | [1 280] | F16 |
| down_blocks.2.attentions.0.proj_in.weight | [1 280, 1 280] | F16 |
| down_blocks.2.attentions.0.proj_out.bias | [1 280] | F16 |
| down_blocks.2.attentions.0.proj_out.weight | [1 280, 1 280] | F16 |
| down_blocks.2.attentions.0.transformer_blocks.0.attn1.to_k.weight | [1 280, 1 280] | F16 |
| down_blocks.2.attentions.0.transformer_blocks.0.attn1.to_out.0.bias | [1 280] | F16 |
| down_blocks.2.attentions.0.transformer_blocks.0.attn1.to_out.0.weight | [1 280, 1 280] | F16 |
| down_blocks.2.attentions.0.transformer_blocks.0.attn1.to_q.weight | [1 280, 1 280] | F16 |
| down_blocks.2.attentions.0.transformer_blocks.0.attn1.to_v.weight | [1 280, 1 280] | F16 |
| down_blocks.2.attentions.0.transformer_blocks.0.attn2.to_k.weight | [1 280, 1 024] | F16 |
| down_blocks.2.attentions.0.transformer_blocks.0.attn2.to_out.0.bias | [1 280] | F16 |
| down_blocks.2.attentions.0.transformer_blocks.0.attn2.to_out.0.weight | [1 280, 1 280] | F16 |
| down_blocks.2.attentions.0.transformer_blocks.0.attn2.to_q.weight | [1 280, 1 280] | F16 |
| down_blocks.2.attentions.0.transformer_blocks.0.attn2.to_v.weight | [1 280, 1 024] | F16 |
| down_blocks.2.attentions.0.transformer_blocks.0.ff.net.0.proj.bias | [10 240] | F16 |
| down_blocks.2.attentions.0.transformer_blocks.0.ff.net.0.proj.weight | [10 240, 1 280] | F16 |
| down_blocks.2.attentions.0.transformer_blocks.0.ff.net.2.bias | [1 280] | F16 |
| down_blocks.2.attentions.0.transformer_blocks.0.ff.net.2.weight | [1 280, 5 120] | F16 |
| down_blocks.2.attentions.0.transformer_blocks.0.norm1.bias | [1 280] | F16 |
| down_blocks.2.attentions.0.transformer_blocks.0.norm1.weight | [1 280] | F16 |
| down_blocks.2.attentions.0.transformer_blocks.0.norm2.bias | [1 280] | F16 |
| down_blocks.2.attentions.0.transformer_blocks.0.norm2.weight | [1 280] | F16 |
| down_blocks.2.attentions.0.transformer_blocks.0.norm3.bias | [1 280] | F16 |
| down_blocks.2.attentions.0.transformer_blocks.0.norm3.weight | [1 280] | F16 |
| down_blocks.2.attentions.1.norm.bias | [1 280] | F16 |
| down_blocks.2.attentions.1.norm.weight | [1 280] | F16 |
| down_blocks.2.attentions.1.proj_in.bias | [1 280] | F16 |
| down_blocks.2.attentions.1.proj_in.weight | [1 280, 1 280] | F16 |
| down_blocks.2.attentions.1.proj_out.bias | [1 280] | F16 |
| down_blocks.2.attentions.1.proj_out.weight | [1 280, 1 280] | F16 |
| down_blocks.2.attentions.1.transformer_blocks.0.attn1.to_k.weight | [1 280, 1 280] | F16 |
| down_blocks.2.attentions.1.transformer_blocks.0.attn1.to_out.0.bias | [1 280] | F16 |
| down_blocks.2.attentions.1.transformer_blocks.0.attn1.to_out.0.weight | [1 280, 1 280] | F16 |
| down_blocks.2.attentions.1.transformer_blocks.0.attn1.to_q.weight | [1 280, 1 280] | F16 |
| down_blocks.2.attentions.1.transformer_blocks.0.attn1.to_v.weight | [1 280, 1 280] | F16 |
| down_blocks.2.attentions.1.transformer_blocks.0.attn2.to_k.weight | [1 280, 1 024] | F16 |
| down_blocks.2.attentions.1.transformer_blocks.0.attn2.to_out.0.bias | [1 280] | F16 |
| down_blocks.2.attentions.1.transformer_blocks.0.attn2.to_out.0.weight | [1 280, 1 280] | F16 |
| down_blocks.2.attentions.1.transformer_blocks.0.attn2.to_q.weight | [1 280, 1 280] | F16 |
| down_blocks.2.attentions.1.transformer_blocks.0.attn2.to_v.weight | [1 280, 1 024] | F16 |
| down_blocks.2.attentions.1.transformer_blocks.0.ff.net.0.proj.bias | [10 240] | F16 |
| down_blocks.2.attentions.1.transformer_blocks.0.ff.net.0.proj.weight | [10 240, 1 280] | F16 |
| down_blocks.2.attentions.1.transformer_blocks.0.ff.net.2.bias | [1 280] | F16 |
| down_blocks.2.attentions.1.transformer_blocks.0.ff.net.2.weight | [1 280, 5 120] | F16 |
| down_blocks.2.attentions.1.transformer_blocks.0.norm1.bias | [1 280] | F16 |
| down_blocks.2.attentions.1.transformer_blocks.0.norm1.weight | [1 280] | F16 |
| down_blocks.2.attentions.1.transformer_blocks.0.norm2.bias | [1 280] | F16 |
| down_blocks.2.attentions.1.transformer_blocks.0.norm2.weight | [1 280] | F16 |
| down_blocks.2.attentions.1.transformer_blocks.0.norm3.bias | [1 280] | F16 |
| down_blocks.2.attentions.1.transformer_blocks.0.norm3.weight | [1 280] | F16 |
| down_blocks.2.downsamplers.0.conv.bias | [1 280] | F16 |
| down_blocks.2.downsamplers.0.conv.weight | [1 280, 1 280, 3, 3] | F16 |
| down_blocks.2.resnets.0.conv1.bias | [1 280] | F16 |
| down_blocks.2.resnets.0.conv1.weight | [1 280, 640, 3, 3] | F16 |
| down_blocks.2.resnets.0.conv2.bias | [1 280] | F16 |
| down_blocks.2.resnets.0.conv2.weight | [1 280, 1 280, 3, 3] | F16 |
| down_blocks.2.resnets.0.conv_shortcut.bias | [1 280] | F16 |
| down_blocks.2.resnets.0.conv_shortcut.weight | [1 280, 640, 1, 1] | F16 |
| down_blocks.2.resnets.0.norm1.bias | [640] | F16 |
| down_blocks.2.resnets.0.norm1.weight | [640] | F16 |
| down_blocks.2.resnets.0.norm2.bias | [1 280] | F16 |
| down_blocks.2.resnets.0.norm2.weight | [1 280] | F16 |
| down_blocks.2.resnets.0.time_emb_proj.bias | [1 280] | F16 |
| down_blocks.2.resnets.0.time_emb_proj.weight | [1 280, 1 280] | F16 |
| down_blocks.2.resnets.1.conv1.bias | [1 280] | F16 |
| down_blocks.2.resnets.1.conv1.weight | [1 280, 1 280, 3, 3] | F16 |
| down_blocks.2.resnets.1.conv2.bias | [1 280] | F16 |
| down_blocks.2.resnets.1.conv2.weight | [1 280, 1 280, 3, 3] | F16 |
| down_blocks.2.resnets.1.norm1.bias | [1 280] | F16 |
| down_blocks.2.resnets.1.norm1.weight | [1 280] | F16 |
| down_blocks.2.resnets.1.norm2.bias | [1 280] | F16 |
| down_blocks.2.resnets.1.norm2.weight | [1 280] | F16 |
| down_blocks.2.resnets.1.time_emb_proj.bias | [1 280] | F16 |
| down_blocks.2.resnets.1.time_emb_proj.weight | [1 280, 1 280] | F16 |
| down_blocks.3.resnets.0.conv1.bias | [1 280] | F16 |
| down_blocks.3.resnets.0.conv1.weight | [1 280, 1 280, 3, 3] | F16 |
| down_blocks.3.resnets.0.conv2.bias | [1 280] | F16 |
| down_blocks.3.resnets.0.conv2.weight | [1 280, 1 280, 3, 3] | F16 |
| down_blocks.3.resnets.0.norm1.bias | [1 280] | F16 |
| down_blocks.3.resnets.0.norm1.weight | [1 280] | F16 |
| down_blocks.3.resnets.0.norm2.bias | [1 280] | F16 |
| down_blocks.3.resnets.0.norm2.weight | [1 280] | F16 |
| down_blocks.3.resnets.0.time_emb_proj.bias | [1 280] | F16 |
| down_blocks.3.resnets.0.time_emb_proj.weight | [1 280, 1 280] | F16 |
| down_blocks.3.resnets.1.conv1.bias | [1 280] | F16 |
| down_blocks.3.resnets.1.conv1.weight | [1 280, 1 280, 3, 3] | F16 |
| down_blocks.3.resnets.1.conv2.bias | [1 280] | F16 |
| down_blocks.3.resnets.1.conv2.weight | [1 280, 1 280, 3, 3] | F16 |
| down_blocks.3.resnets.1.norm1.bias | [1 280] | F16 |
| down_blocks.3.resnets.1.norm1.weight | [1 280] | F16 |
| down_blocks.3.resnets.1.norm2.bias | [1 280] | F16 |
| down_blocks.3.resnets.1.norm2.weight | [1 280] | F16 |
| down_blocks.3.resnets.1.time_emb_proj.bias | [1 280] | F16 |
| down_blocks.3.resnets.1.time_emb_proj.weight | [1 280, 1 280] | F16 |
| mid_block.attentions.0.norm.bias | [1 280] | F16 |
| mid_block.attentions.0.norm.weight | [1 280] | F16 |
| mid_block.attentions.0.proj_in.bias | [1 280] | F16 |
| mid_block.attentions.0.proj_in.weight | [1 280, 1 280] | F16 |
| mid_block.attentions.0.proj_out.bias | [1 280] | F16 |
| mid_block.attentions.0.proj_out.weight | [1 280, 1 280] | F16 |
| mid_block.attentions.0.transformer_blocks.0.attn1.to_k.weight | [1 280, 1 280] | F16 |
| mid_block.attentions.0.transformer_blocks.0.attn1.to_out.0.bias | [1 280] | F16 |
| mid_block.attentions.0.transformer_blocks.0.attn1.to_out.0.weight | [1 280, 1 280] | F16 |
| mid_block.attentions.0.transformer_blocks.0.attn1.to_q.weight | [1 280, 1 280] | F16 |
| mid_block.attentions.0.transformer_blocks.0.attn1.to_v.weight | [1 280, 1 280] | F16 |
| mid_block.attentions.0.transformer_blocks.0.attn2.to_k.weight | [1 280, 1 024] | F16 |
| mid_block.attentions.0.transformer_blocks.0.attn2.to_out.0.bias | [1 280] | F16 |
| mid_block.attentions.0.transformer_blocks.0.attn2.to_out.0.weight | [1 280, 1 280] | F16 |
| mid_block.attentions.0.transformer_blocks.0.attn2.to_q.weight | [1 280, 1 280] | F16 |
| mid_block.attentions.0.transformer_blocks.0.attn2.to_v.weight | [1 280, 1 024] | F16 |
| mid_block.attentions.0.transformer_blocks.0.ff.net.0.proj.bias | [10 240] | F16 |
| mid_block.attentions.0.transformer_blocks.0.ff.net.0.proj.weight | [10 240, 1 280] | F16 |
| mid_block.attentions.0.transformer_blocks.0.ff.net.2.bias | [1 280] | F16 |
| mid_block.attentions.0.transformer_blocks.0.ff.net.2.weight | [1 280, 5 120] | F16 |
| mid_block.attentions.0.transformer_blocks.0.norm1.bias | [1 280] | F16 |
| mid_block.attentions.0.transformer_blocks.0.norm1.weight | [1 280] | F16 |
| mid_block.attentions.0.transformer_blocks.0.norm2.bias | [1 280] | F16 |
| mid_block.attentions.0.transformer_blocks.0.norm2.weight | [1 280] | F16 |
| mid_block.attentions.0.transformer_blocks.0.norm3.bias | [1 280] | F16 |
| mid_block.attentions.0.transformer_blocks.0.norm3.weight | [1 280] | F16 |
| mid_block.resnets.0.conv1.bias | [1 280] | F16 |
| mid_block.resnets.0.conv1.weight | [1 280, 1 280, 3, 3] | F16 |
| mid_block.resnets.0.conv2.bias | [1 280] | F16 |
| mid_block.resnets.0.conv2.weight | [1 280, 1 280, 3, 3] | F16 |
| mid_block.resnets.0.norm1.bias | [1 280] | F16 |
| mid_block.resnets.0.norm1.weight | [1 280] | F16 |
| mid_block.resnets.0.norm2.bias | [1 280] | F16 |
| mid_block.resnets.0.norm2.weight | [1 280] | F16 |
| mid_block.resnets.0.time_emb_proj.bias | [1 280] | F16 |
| mid_block.resnets.0.time_emb_proj.weight | [1 280, 1 280] | F16 |
| mid_block.resnets.1.conv1.bias | [1 280] | F16 |
| mid_block.resnets.1.conv1.weight | [1 280, 1 280, 3, 3] | F16 |
| mid_block.resnets.1.conv2.bias | [1 280] | F16 |
| mid_block.resnets.1.conv2.weight | [1 280, 1 280, 3, 3] | F16 |
| mid_block.resnets.1.norm1.bias | [1 280] | F16 |
| mid_block.resnets.1.norm1.weight | [1 280] | F16 |
| mid_block.resnets.1.norm2.bias | [1 280] | F16 |
| mid_block.resnets.1.norm2.weight | [1 280] | F16 |
| mid_block.resnets.1.time_emb_proj.bias | [1 280] | F16 |
| mid_block.resnets.1.time_emb_proj.weight | [1 280, 1 280] | F16 |
| up_blocks.0.resnets.0.conv1.bias | [1 280] | F16 |
| up_blocks.0.resnets.0.conv1.weight | [1 280, 2 560, 3, 3] | F16 |
| up_blocks.0.resnets.0.conv2.bias | [1 280] | F16 |
| up_blocks.0.resnets.0.conv2.weight | [1 280, 1 280, 3, 3] | F16 |
| up_blocks.0.resnets.0.conv_shortcut.bias | [1 280] | F16 |
| up_blocks.0.resnets.0.conv_shortcut.weight | [1 280, 2 560, 1, 1] | F16 |
| up_blocks.0.resnets.0.norm1.bias | [2 560] | F16 |
| up_blocks.0.resnets.0.norm1.weight | [2 560] | F16 |
| up_blocks.0.resnets.0.norm2.bias | [1 280] | F16 |
| up_blocks.0.resnets.0.norm2.weight | [1 280] | F16 |
| up_blocks.0.resnets.0.time_emb_proj.bias | [1 280] | F16 |
| up_blocks.0.resnets.0.time_emb_proj.weight | [1 280, 1 280] | F16 |
| up_blocks.0.resnets.1.conv1.bias | [1 280] | F16 |
| up_blocks.0.resnets.1.conv1.weight | [1 280, 2 560, 3, 3] | F16 |
| up_blocks.0.resnets.1.conv2.bias | [1 280] | F16 |
| up_blocks.0.resnets.1.conv2.weight | [1 280, 1 280, 3, 3] | F16 |
| up_blocks.0.resnets.1.conv_shortcut.bias | [1 280] | F16 |
| up_blocks.0.resnets.1.conv_shortcut.weight | [1 280, 2 560, 1, 1] | F16 |
| up_blocks.0.resnets.1.norm1.bias | [2 560] | F16 |
| up_blocks.0.resnets.1.norm1.weight | [2 560] | F16 |
| up_blocks.0.resnets.1.norm2.bias | [1 280] | F16 |
| up_blocks.0.resnets.1.norm2.weight | [1 280] | F16 |
| up_blocks.0.resnets.1.time_emb_proj.bias | [1 280] | F16 |
| up_blocks.0.resnets.1.time_emb_proj.weight | [1 280, 1 280] | F16 |
| up_blocks.0.resnets.2.conv1.bias | [1 280] | F16 |
| up_blocks.0.resnets.2.conv1.weight | [1 280, 2 560, 3, 3] | F16 |
| up_blocks.0.resnets.2.conv2.bias | [1 280] | F16 |
| up_blocks.0.resnets.2.conv2.weight | [1 280, 1 280, 3, 3] | F16 |
| up_blocks.0.resnets.2.conv_shortcut.bias | [1 280] | F16 |
| up_blocks.0.resnets.2.conv_shortcut.weight | [1 280, 2 560, 1, 1] | F16 |
| up_blocks.0.resnets.2.norm1.bias | [2 560] | F16 |
| up_blocks.0.resnets.2.norm1.weight | [2 560] | F16 |
| up_blocks.0.resnets.2.norm2.bias | [1 280] | F16 |
| up_blocks.0.resnets.2.norm2.weight | [1 280] | F16 |
| up_blocks.0.resnets.2.time_emb_proj.bias | [1 280] | F16 |
| up_blocks.0.resnets.2.time_emb_proj.weight | [1 280, 1 280] | F16 |
| up_blocks.0.upsamplers.0.conv.bias | [1 280] | F16 |
| up_blocks.0.upsamplers.0.conv.weight | [1 280, 1 280, 3, 3] | F16 |
| up_blocks.1.attentions.0.norm.bias | [1 280] | F16 |
| up_blocks.1.attentions.0.norm.weight | [1 280] | F16 |
| up_blocks.1.attentions.0.proj_in.bias | [1 280] | F16 |
| up_blocks.1.attentions.0.proj_in.weight | [1 280, 1 280] | F16 |
| up_blocks.1.attentions.0.proj_out.bias | [1 280] | F16 |
| up_blocks.1.attentions.0.proj_out.weight | [1 280, 1 280] | F16 |
| up_blocks.1.attentions.0.transformer_blocks.0.attn1.to_k.weight | [1 280, 1 280] | F16 |
| up_blocks.1.attentions.0.transformer_blocks.0.attn1.to_out.0.bias | [1 280] | F16 |
| up_blocks.1.attentions.0.transformer_blocks.0.attn1.to_out.0.weight | [1 280, 1 280] | F16 |
| up_blocks.1.attentions.0.transformer_blocks.0.attn1.to_q.weight | [1 280, 1 280] | F16 |
| up_blocks.1.attentions.0.transformer_blocks.0.attn1.to_v.weight | [1 280, 1 280] | F16 |
| up_blocks.1.attentions.0.transformer_blocks.0.attn2.to_k.weight | [1 280, 1 024] | F16 |
| up_blocks.1.attentions.0.transformer_blocks.0.attn2.to_out.0.bias | [1 280] | F16 |
| up_blocks.1.attentions.0.transformer_blocks.0.attn2.to_out.0.weight | [1 280, 1 280] | F16 |
| up_blocks.1.attentions.0.transformer_blocks.0.attn2.to_q.weight | [1 280, 1 280] | F16 |
| up_blocks.1.attentions.0.transformer_blocks.0.attn2.to_v.weight | [1 280, 1 024] | F16 |
| up_blocks.1.attentions.0.transformer_blocks.0.ff.net.0.proj.bias | [10 240] | F16 |
| up_blocks.1.attentions.0.transformer_blocks.0.ff.net.0.proj.weight | [10 240, 1 280] | F16 |
| up_blocks.1.attentions.0.transformer_blocks.0.ff.net.2.bias | [1 280] | F16 |
| up_blocks.1.attentions.0.transformer_blocks.0.ff.net.2.weight | [1 280, 5 120] | F16 |
| up_blocks.1.attentions.0.transformer_blocks.0.norm1.bias | [1 280] | F16 |
| up_blocks.1.attentions.0.transformer_blocks.0.norm1.weight | [1 280] | F16 |
| up_blocks.1.attentions.0.transformer_blocks.0.norm2.bias | [1 280] | F16 |
| up_blocks.1.attentions.0.transformer_blocks.0.norm2.weight | [1 280] | F16 |
| up_blocks.1.attentions.0.transformer_blocks.0.norm3.bias | [1 280] | F16 |
| up_blocks.1.attentions.0.transformer_blocks.0.norm3.weight | [1 280] | F16 |
| up_blocks.1.attentions.1.norm.bias | [1 280] | F16 |
| up_blocks.1.attentions.1.norm.weight | [1 280] | F16 |
| up_blocks.1.attentions.1.proj_in.bias | [1 280] | F16 |
| up_blocks.1.attentions.1.proj_in.weight | [1 280, 1 280] | F16 |
| up_blocks.1.attentions.1.proj_out.bias | [1 280] | F16 |
| up_blocks.1.attentions.1.proj_out.weight | [1 280, 1 280] | F16 |
| up_blocks.1.attentions.1.transformer_blocks.0.attn1.to_k.weight | [1 280, 1 280] | F16 |
| up_blocks.1.attentions.1.transformer_blocks.0.attn1.to_out.0.bias | [1 280] | F16 |
| up_blocks.1.attentions.1.transformer_blocks.0.attn1.to_out.0.weight | [1 280, 1 280] | F16 |
| up_blocks.1.attentions.1.transformer_blocks.0.attn1.to_q.weight | [1 280, 1 280] | F16 |
| up_blocks.1.attentions.1.transformer_blocks.0.attn1.to_v.weight | [1 280, 1 280] | F16 |
| up_blocks.1.attentions.1.transformer_blocks.0.attn2.to_k.weight | [1 280, 1 024] | F16 |
| up_blocks.1.attentions.1.transformer_blocks.0.attn2.to_out.0.bias | [1 280] | F16 |
| up_blocks.1.attentions.1.transformer_blocks.0.attn2.to_out.0.weight | [1 280, 1 280] | F16 |
| up_blocks.1.attentions.1.transformer_blocks.0.attn2.to_q.weight | [1 280, 1 280] | F16 |
| up_blocks.1.attentions.1.transformer_blocks.0.attn2.to_v.weight | [1 280, 1 024] | F16 |
| up_blocks.1.attentions.1.transformer_blocks.0.ff.net.0.proj.bias | [10 240] | F16 |
| up_blocks.1.attentions.1.transformer_blocks.0.ff.net.0.proj.weight | [10 240, 1 280] | F16 |
| up_blocks.1.attentions.1.transformer_blocks.0.ff.net.2.bias | [1 280] | F16 |
| up_blocks.1.attentions.1.transformer_blocks.0.ff.net.2.weight | [1 280, 5 120] | F16 |
| up_blocks.1.attentions.1.transformer_blocks.0.norm1.bias | [1 280] | F16 |
| up_blocks.1.attentions.1.transformer_blocks.0.norm1.weight | [1 280] | F16 |
| up_blocks.1.attentions.1.transformer_blocks.0.norm2.bias | [1 280] | F16 |
| up_blocks.1.attentions.1.transformer_blocks.0.norm2.weight | [1 280] | F16 |
| up_blocks.1.attentions.1.transformer_blocks.0.norm3.bias | [1 280] | F16 |
| up_blocks.1.attentions.1.transformer_blocks.0.norm3.weight | [1 280] | F16 |
| up_blocks.1.attentions.2.norm.bias | [1 280] | F16 |
| up_blocks.1.attentions.2.norm.weight | [1 280] | F16 |
| up_blocks.1.attentions.2.proj_in.bias | [1 280] | F16 |
| up_blocks.1.attentions.2.proj_in.weight | [1 280, 1 280] | F16 |
| up_blocks.1.attentions.2.proj_out.bias | [1 280] | F16 |
| up_blocks.1.attentions.2.proj_out.weight | [1 280, 1 280] | F16 |
| up_blocks.1.attentions.2.transformer_blocks.0.attn1.to_k.weight | [1 280, 1 280] | F16 |
| up_blocks.1.attentions.2.transformer_blocks.0.attn1.to_out.0.bias | [1 280] | F16 |
| up_blocks.1.attentions.2.transformer_blocks.0.attn1.to_out.0.weight | [1 280, 1 280] | F16 |
| up_blocks.1.attentions.2.transformer_blocks.0.attn1.to_q.weight | [1 280, 1 280] | F16 |
| up_blocks.1.attentions.2.transformer_blocks.0.attn1.to_v.weight | [1 280, 1 280] | F16 |
| up_blocks.1.attentions.2.transformer_blocks.0.attn2.to_k.weight | [1 280, 1 024] | F16 |
| up_blocks.1.attentions.2.transformer_blocks.0.attn2.to_out.0.bias | [1 280] | F16 |
| up_blocks.1.attentions.2.transformer_blocks.0.attn2.to_out.0.weight | [1 280, 1 280] | F16 |
| up_blocks.1.attentions.2.transformer_blocks.0.attn2.to_q.weight | [1 280, 1 280] | F16 |
| up_blocks.1.attentions.2.transformer_blocks.0.attn2.to_v.weight | [1 280, 1 024] | F16 |
| up_blocks.1.attentions.2.transformer_blocks.0.ff.net.0.proj.bias | [10 240] | F16 |
| up_blocks.1.attentions.2.transformer_blocks.0.ff.net.0.proj.weight | [10 240, 1 280] | F16 |
| up_blocks.1.attentions.2.transformer_blocks.0.ff.net.2.bias | [1 280] | F16 |
| up_blocks.1.attentions.2.transformer_blocks.0.ff.net.2.weight | [1 280, 5 120] | F16 |
| up_blocks.1.attentions.2.transformer_blocks.0.norm1.bias | [1 280] | F16 |
| up_blocks.1.attentions.2.transformer_blocks.0.norm1.weight | [1 280] | F16 |
| up_blocks.1.attentions.2.transformer_blocks.0.norm2.bias | [1 280] | F16 |
| up_blocks.1.attentions.2.transformer_blocks.0.norm2.weight | [1 280] | F16 |
| up_blocks.1.attentions.2.transformer_blocks.0.norm3.bias | [1 280] | F16 |
| up_blocks.1.attentions.2.transformer_blocks.0.norm3.weight | [1 280] | F16 |
| up_blocks.1.resnets.0.conv1.bias | [1 280] | F16 |
| up_blocks.1.resnets.0.conv1.weight | [1 280, 2 560, 3, 3] | F16 |
| up_blocks.1.resnets.0.conv2.bias | [1 280] | F16 |
| up_blocks.1.resnets.0.conv2.weight | [1 280, 1 280, 3, 3] | F16 |
| up_blocks.1.resnets.0.conv_shortcut.bias | [1 280] | F16 |
| up_blocks.1.resnets.0.conv_shortcut.weight | [1 280, 2 560, 1, 1] | F16 |
| up_blocks.1.resnets.0.norm1.bias | [2 560] | F16 |
| up_blocks.1.resnets.0.norm1.weight | [2 560] | F16 |
| up_blocks.1.resnets.0.norm2.bias | [1 280] | F16 |
| up_blocks.1.resnets.0.norm2.weight | [1 280] | F16 |
| up_blocks.1.resnets.0.time_emb_proj.bias | [1 280] | F16 |
| up_blocks.1.resnets.0.time_emb_proj.weight | [1 280, 1 280] | F16 |
| up_blocks.1.resnets.1.conv1.bias | [1 280] | F16 |
| up_blocks.1.resnets.1.conv1.weight | [1 280, 2 560, 3, 3] | F16 |
| up_blocks.1.resnets.1.conv2.bias | [1 280] | F16 |
| up_blocks.1.resnets.1.conv2.weight | [1 280, 1 280, 3, 3] | F16 |
| up_blocks.1.resnets.1.conv_shortcut.bias | [1 280] | F16 |
| up_blocks.1.resnets.1.conv_shortcut.weight | [1 280, 2 560, 1, 1] | F16 |
| up_blocks.1.resnets.1.norm1.bias | [2 560] | F16 |
| up_blocks.1.resnets.1.norm1.weight | [2 560] | F16 |
| up_blocks.1.resnets.1.norm2.bias | [1 280] | F16 |
| up_blocks.1.resnets.1.norm2.weight | [1 280] | F16 |
| up_blocks.1.resnets.1.time_emb_proj.bias | [1 280] | F16 |
| up_blocks.1.resnets.1.time_emb_proj.weight | [1 280, 1 280] | F16 |
| up_blocks.1.resnets.2.conv1.bias | [1 280] | F16 |
| up_blocks.1.resnets.2.conv1.weight | [1 280, 1 920, 3, 3] | F16 |
| up_blocks.1.resnets.2.conv2.bias | [1 280] | F16 |
| up_blocks.1.resnets.2.conv2.weight | [1 280, 1 280, 3, 3] | F16 |
| up_blocks.1.resnets.2.conv_shortcut.bias | [1 280] | F16 |
| up_blocks.1.resnets.2.conv_shortcut.weight | [1 280, 1 920, 1, 1] | F16 |
| up_blocks.1.resnets.2.norm1.bias | [1 920] | F16 |
| up_blocks.1.resnets.2.norm1.weight | [1 920] | F16 |
| up_blocks.1.resnets.2.norm2.bias | [1 280] | F16 |
| up_blocks.1.resnets.2.norm2.weight | [1 280] | F16 |
| up_blocks.1.resnets.2.time_emb_proj.bias | [1 280] | F16 |
| up_blocks.1.resnets.2.time_emb_proj.weight | [1 280, 1 280] | F16 |
| up_blocks.1.upsamplers.0.conv.bias | [1 280] | F16 |
| up_blocks.1.upsamplers.0.conv.weight | [1 280, 1 280, 3, 3] | F16 |
| up_blocks.2.attentions.0.norm.bias | [640] | F16 |
| up_blocks.2.attentions.0.norm.weight | [640] | F16 |
| up_blocks.2.attentions.0.proj_in.bias | [640] | F16 |
| up_blocks.2.attentions.0.proj_in.weight | [640, 640] | F16 |
| up_blocks.2.attentions.0.proj_out.bias | [640] | F16 |
| up_blocks.2.attentions.0.proj_out.weight | [640, 640] | F16 |
| up_blocks.2.attentions.0.transformer_blocks.0.attn1.to_k.weight | [640, 640] | F16 |
| up_blocks.2.attentions.0.transformer_blocks.0.attn1.to_out.0.bias | [640] | F16 |
| up_blocks.2.attentions.0.transformer_blocks.0.attn1.to_out.0.weight | [640, 640] | F16 |
| up_blocks.2.attentions.0.transformer_blocks.0.attn1.to_q.weight | [640, 640] | F16 |
| up_blocks.2.attentions.0.transformer_blocks.0.attn1.to_v.weight | [640, 640] | F16 |
| up_blocks.2.attentions.0.transformer_blocks.0.attn2.to_k.weight | [640, 1 024] | F16 |
| up_blocks.2.attentions.0.transformer_blocks.0.attn2.to_out.0.bias | [640] | F16 |
| up_blocks.2.attentions.0.transformer_blocks.0.attn2.to_out.0.weight | [640, 640] | F16 |
| up_blocks.2.attentions.0.transformer_blocks.0.attn2.to_q.weight | [640, 640] | F16 |
| up_blocks.2.attentions.0.transformer_blocks.0.attn2.to_v.weight | [640, 1 024] | F16 |
| up_blocks.2.attentions.0.transformer_blocks.0.ff.net.0.proj.bias | [5 120] | F16 |
| up_blocks.2.attentions.0.transformer_blocks.0.ff.net.0.proj.weight | [5 120, 640] | F16 |
| up_blocks.2.attentions.0.transformer_blocks.0.ff.net.2.bias | [640] | F16 |
| up_blocks.2.attentions.0.transformer_blocks.0.ff.net.2.weight | [640, 2 560] | F16 |
| up_blocks.2.attentions.0.transformer_blocks.0.norm1.bias | [640] | F16 |
| up_blocks.2.attentions.0.transformer_blocks.0.norm1.weight | [640] | F16 |
| up_blocks.2.attentions.0.transformer_blocks.0.norm2.bias | [640] | F16 |
| up_blocks.2.attentions.0.transformer_blocks.0.norm2.weight | [640] | F16 |
| up_blocks.2.attentions.0.transformer_blocks.0.norm3.bias | [640] | F16 |
| up_blocks.2.attentions.0.transformer_blocks.0.norm3.weight | [640] | F16 |
| up_blocks.2.attentions.1.norm.bias | [640] | F16 |
| up_blocks.2.attentions.1.norm.weight | [640] | F16 |
| up_blocks.2.attentions.1.proj_in.bias | [640] | F16 |
| up_blocks.2.attentions.1.proj_in.weight | [640, 640] | F16 |
| up_blocks.2.attentions.1.proj_out.bias | [640] | F16 |
| up_blocks.2.attentions.1.proj_out.weight | [640, 640] | F16 |
| up_blocks.2.attentions.1.transformer_blocks.0.attn1.to_k.weight | [640, 640] | F16 |
| up_blocks.2.attentions.1.transformer_blocks.0.attn1.to_out.0.bias | [640] | F16 |
| up_blocks.2.attentions.1.transformer_blocks.0.attn1.to_out.0.weight | [640, 640] | F16 |
| up_blocks.2.attentions.1.transformer_blocks.0.attn1.to_q.weight | [640, 640] | F16 |
| up_blocks.2.attentions.1.transformer_blocks.0.attn1.to_v.weight | [640, 640] | F16 |
| up_blocks.2.attentions.1.transformer_blocks.0.attn2.to_k.weight | [640, 1 024] | F16 |
| up_blocks.2.attentions.1.transformer_blocks.0.attn2.to_out.0.bias | [640] | F16 |
| up_blocks.2.attentions.1.transformer_blocks.0.attn2.to_out.0.weight | [640, 640] | F16 |
| up_blocks.2.attentions.1.transformer_blocks.0.attn2.to_q.weight | [640, 640] | F16 |
| up_blocks.2.attentions.1.transformer_blocks.0.attn2.to_v.weight | [640, 1 024] | F16 |
| up_blocks.2.attentions.1.transformer_blocks.0.ff.net.0.proj.bias | [5 120] | F16 |
| up_blocks.2.attentions.1.transformer_blocks.0.ff.net.0.proj.weight | [5 120, 640] | F16 |
| up_blocks.2.attentions.1.transformer_blocks.0.ff.net.2.bias | [640] | F16 |
| up_blocks.2.attentions.1.transformer_blocks.0.ff.net.2.weight | [640, 2 560] | F16 |
| up_blocks.2.attentions.1.transformer_blocks.0.norm1.bias | [640] | F16 |
| up_blocks.2.attentions.1.transformer_blocks.0.norm1.weight | [640] | F16 |
| up_blocks.2.attentions.1.transformer_blocks.0.norm2.bias | [640] | F16 |
| up_blocks.2.attentions.1.transformer_blocks.0.norm2.weight | [640] | F16 |
| up_blocks.2.attentions.1.transformer_blocks.0.norm3.bias | [640] | F16 |
| up_blocks.2.attentions.1.transformer_blocks.0.norm3.weight | [640] | F16 |
| up_blocks.2.attentions.2.norm.bias | [640] | F16 |
| up_blocks.2.attentions.2.norm.weight | [640] | F16 |
| up_blocks.2.attentions.2.proj_in.bias | [640] | F16 |
| up_blocks.2.attentions.2.proj_in.weight | [640, 640] | F16 |
| up_blocks.2.attentions.2.proj_out.bias | [640] | F16 |
| up_blocks.2.attentions.2.proj_out.weight | [640, 640] | F16 |
| up_blocks.2.attentions.2.transformer_blocks.0.attn1.to_k.weight | [640, 640] | F16 |
| up_blocks.2.attentions.2.transformer_blocks.0.attn1.to_out.0.bias | [640] | F16 |
| up_blocks.2.attentions.2.transformer_blocks.0.attn1.to_out.0.weight | [640, 640] | F16 |
| up_blocks.2.attentions.2.transformer_blocks.0.attn1.to_q.weight | [640, 640] | F16 |
| up_blocks.2.attentions.2.transformer_blocks.0.attn1.to_v.weight | [640, 640] | F16 |
| up_blocks.2.attentions.2.transformer_blocks.0.attn2.to_k.weight | [640, 1 024] | F16 |
| up_blocks.2.attentions.2.transformer_blocks.0.attn2.to_out.0.bias | [640] | F16 |
| up_blocks.2.attentions.2.transformer_blocks.0.attn2.to_out.0.weight | [640, 640] | F16 |
| up_blocks.2.attentions.2.transformer_blocks.0.attn2.to_q.weight | [640, 640] | F16 |
| up_blocks.2.attentions.2.transformer_blocks.0.attn2.to_v.weight | [640, 1 024] | F16 |
| up_blocks.2.attentions.2.transformer_blocks.0.ff.net.0.proj.bias | [5 120] | F16 |
| up_blocks.2.attentions.2.transformer_blocks.0.ff.net.0.proj.weight | [5 120, 640] | F16 |
| up_blocks.2.attentions.2.transformer_blocks.0.ff.net.2.bias | [640] | F16 |
| up_blocks.2.attentions.2.transformer_blocks.0.ff.net.2.weight | [640, 2 560] | F16 |
| up_blocks.2.attentions.2.transformer_blocks.0.norm1.bias | [640] | F16 |
| up_blocks.2.attentions.2.transformer_blocks.0.norm1.weight | [640] | F16 |
| up_blocks.2.attentions.2.transformer_blocks.0.norm2.bias | [640] | F16 |
| up_blocks.2.attentions.2.transformer_blocks.0.norm2.weight | [640] | F16 |
| up_blocks.2.attentions.2.transformer_blocks.0.norm3.bias | [640] | F16 |
| up_blocks.2.attentions.2.transformer_blocks.0.norm3.weight | [640] | F16 |
| up_blocks.2.resnets.0.conv1.bias | [640] | F16 |
| up_blocks.2.resnets.0.conv1.weight | [640, 1 920, 3, 3] | F16 |
| up_blocks.2.resnets.0.conv2.bias | [640] | F16 |
| up_blocks.2.resnets.0.conv2.weight | [640, 640, 3, 3] | F16 |
| up_blocks.2.resnets.0.conv_shortcut.bias | [640] | F16 |
| up_blocks.2.resnets.0.conv_shortcut.weight | [640, 1 920, 1, 1] | F16 |
| up_blocks.2.resnets.0.norm1.bias | [1 920] | F16 |
| up_blocks.2.resnets.0.norm1.weight | [1 920] | F16 |
| up_blocks.2.resnets.0.norm2.bias | [640] | F16 |
| up_blocks.2.resnets.0.norm2.weight | [640] | F16 |
| up_blocks.2.resnets.0.time_emb_proj.bias | [640] | F16 |
| up_blocks.2.resnets.0.time_emb_proj.weight | [640, 1 280] | F16 |
| up_blocks.2.resnets.1.conv1.bias | [640] | F16 |
| up_blocks.2.resnets.1.conv1.weight | [640, 1 280, 3, 3] | F16 |
| up_blocks.2.resnets.1.conv2.bias | [640] | F16 |
| up_blocks.2.resnets.1.conv2.weight | [640, 640, 3, 3] | F16 |
| up_blocks.2.resnets.1.conv_shortcut.bias | [640] | F16 |
| up_blocks.2.resnets.1.conv_shortcut.weight | [640, 1 280, 1, 1] | F16 |
| up_blocks.2.resnets.1.norm1.bias | [1 280] | F16 |
| up_blocks.2.resnets.1.norm1.weight | [1 280] | F16 |
| up_blocks.2.resnets.1.norm2.bias | [640] | F16 |
| up_blocks.2.resnets.1.norm2.weight | [640] | F16 |
| up_blocks.2.resnets.1.time_emb_proj.bias | [640] | F16 |
| up_blocks.2.resnets.1.time_emb_proj.weight | [640, 1 280] | F16 |
| up_blocks.2.resnets.2.conv1.bias | [640] | F16 |
| up_blocks.2.resnets.2.conv1.weight | [640, 960, 3, 3] | F16 |
| up_blocks.2.resnets.2.conv2.bias | [640] | F16 |
| up_blocks.2.resnets.2.conv2.weight | [640, 640, 3, 3] | F16 |
| up_blocks.2.resnets.2.conv_shortcut.bias | [640] | F16 |
| up_blocks.2.resnets.2.conv_shortcut.weight | [640, 960, 1, 1] | F16 |
| up_blocks.2.resnets.2.norm1.bias | [960] | F16 |
| up_blocks.2.resnets.2.norm1.weight | [960] | F16 |
| up_blocks.2.resnets.2.norm2.bias | [640] | F16 |
| up_blocks.2.resnets.2.norm2.weight | [640] | F16 |
| up_blocks.2.resnets.2.time_emb_proj.bias | [640] | F16 |
| up_blocks.2.resnets.2.time_emb_proj.weight | [640, 1 280] | F16 |
| up_blocks.2.upsamplers.0.conv.bias | [640] | F16 |
| up_blocks.2.upsamplers.0.conv.weight | [640, 640, 3, 3] | F16 |
| up_blocks.3.attentions.0.norm.bias | [320] | F16 |
| up_blocks.3.attentions.0.norm.weight | [320] | F16 |
| up_blocks.3.attentions.0.proj_in.bias | [320] | F16 |
| up_blocks.3.attentions.0.proj_in.weight | [320, 320] | F16 |
| up_blocks.3.attentions.0.proj_out.bias | [320] | F16 |
| up_blocks.3.attentions.0.proj_out.weight | [320, 320] | F16 |
| up_blocks.3.attentions.0.transformer_blocks.0.attn1.to_k.weight | [320, 320] | F16 |
| up_blocks.3.attentions.0.transformer_blocks.0.attn1.to_out.0.bias | [320] | F16 |
| up_blocks.3.attentions.0.transformer_blocks.0.attn1.to_out.0.weight | [320, 320] | F16 |
| up_blocks.3.attentions.0.transformer_blocks.0.attn1.to_q.weight | [320, 320] | F16 |
| up_blocks.3.attentions.0.transformer_blocks.0.attn1.to_v.weight | [320, 320] | F16 |
| up_blocks.3.attentions.0.transformer_blocks.0.attn2.to_k.weight | [320, 1 024] | F16 |
| up_blocks.3.attentions.0.transformer_blocks.0.attn2.to_out.0.bias | [320] | F16 |
| up_blocks.3.attentions.0.transformer_blocks.0.attn2.to_out.0.weight | [320, 320] | F16 |
| up_blocks.3.attentions.0.transformer_blocks.0.attn2.to_q.weight | [320, 320] | F16 |
| up_blocks.3.attentions.0.transformer_blocks.0.attn2.to_v.weight | [320, 1 024] | F16 |
| up_blocks.3.attentions.0.transformer_blocks.0.ff.net.0.proj.bias | [2 560] | F16 |
| up_blocks.3.attentions.0.transformer_blocks.0.ff.net.0.proj.weight | [2 560, 320] | F16 |
| up_blocks.3.attentions.0.transformer_blocks.0.ff.net.2.bias | [320] | F16 |
| up_blocks.3.attentions.0.transformer_blocks.0.ff.net.2.weight | [320, 1 280] | F16 |
| up_blocks.3.attentions.0.transformer_blocks.0.norm1.bias | [320] | F16 |
| up_blocks.3.attentions.0.transformer_blocks.0.norm1.weight | [320] | F16 |
| up_blocks.3.attentions.0.transformer_blocks.0.norm2.bias | [320] | F16 |
| up_blocks.3.attentions.0.transformer_blocks.0.norm2.weight | [320] | F16 |
| up_blocks.3.attentions.0.transformer_blocks.0.norm3.bias | [320] | F16 |
| up_blocks.3.attentions.0.transformer_blocks.0.norm3.weight | [320] | F16 |
| up_blocks.3.attentions.1.norm.bias | [320] | F16 |
| up_blocks.3.attentions.1.norm.weight | [320] | F16 |
| up_blocks.3.attentions.1.proj_in.bias | [320] | F16 |
| up_blocks.3.attentions.1.proj_in.weight | [320, 320] | F16 |
| up_blocks.3.attentions.1.proj_out.bias | [320] | F16 |
| up_blocks.3.attentions.1.proj_out.weight | [320, 320] | F16 |
| up_blocks.3.attentions.1.transformer_blocks.0.attn1.to_k.weight | [320, 320] | F16 |
| up_blocks.3.attentions.1.transformer_blocks.0.attn1.to_out.0.bias | [320] | F16 |
| up_blocks.3.attentions.1.transformer_blocks.0.attn1.to_out.0.weight | [320, 320] | F16 |
| up_blocks.3.attentions.1.transformer_blocks.0.attn1.to_q.weight | [320, 320] | F16 |
| up_blocks.3.attentions.1.transformer_blocks.0.attn1.to_v.weight | [320, 320] | F16 |
| up_blocks.3.attentions.1.transformer_blocks.0.attn2.to_k.weight | [320, 1 024] | F16 |
| up_blocks.3.attentions.1.transformer_blocks.0.attn2.to_out.0.bias | [320] | F16 |
| up_blocks.3.attentions.1.transformer_blocks.0.attn2.to_out.0.weight | [320, 320] | F16 |
| up_blocks.3.attentions.1.transformer_blocks.0.attn2.to_q.weight | [320, 320] | F16 |
| up_blocks.3.attentions.1.transformer_blocks.0.attn2.to_v.weight | [320, 1 024] | F16 |
| up_blocks.3.attentions.1.transformer_blocks.0.ff.net.0.proj.bias | [2 560] | F16 |
| up_blocks.3.attentions.1.transformer_blocks.0.ff.net.0.proj.weight | [2 560, 320] | F16 |
| up_blocks.3.attentions.1.transformer_blocks.0.ff.net.2.bias | [320] | F16 |
| up_blocks.3.attentions.1.transformer_blocks.0.ff.net.2.weight | [320, 1 280] | F16 |
| up_blocks.3.attentions.1.transformer_blocks.0.norm1.bias | [320] | F16 |
| up_blocks.3.attentions.1.transformer_blocks.0.norm1.weight | [320] | F16 |
| up_blocks.3.attentions.1.transformer_blocks.0.norm2.bias | [320] | F16 |
| up_blocks.3.attentions.1.transformer_blocks.0.norm2.weight | [320] | F16 |
| up_blocks.3.attentions.1.transformer_blocks.0.norm3.bias | [320] | F16 |
| up_blocks.3.attentions.1.transformer_blocks.0.norm3.weight | [320] | F16 |
| up_blocks.3.attentions.2.norm.bias | [320] | F16 |
| up_blocks.3.attentions.2.norm.weight | [320] | F16 |
| up_blocks.3.attentions.2.proj_in.bias | [320] | F16 |
| up_blocks.3.attentions.2.proj_in.weight | [320, 320] | F16 |
| up_blocks.3.attentions.2.proj_out.bias | [320] | F16 |
| up_blocks.3.attentions.2.proj_out.weight | [320, 320] | F16 |
| up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_k.weight | [320, 320] | F16 |
| up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_out.0.bias | [320] | F16 |
| up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_out.0.weight | [320, 320] | F16 |
| up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_q.weight | [320, 320] | F16 |
| up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_v.weight | [320, 320] | F16 |
| up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_k.weight | [320, 1 024] | F16 |
| up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_out.0.bias | [320] | F16 |
| up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_out.0.weight | [320, 320] | F16 |
| up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_q.weight | [320, 320] | F16 |
| up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_v.weight | [320, 1 024] | F16 |
| up_blocks.3.attentions.2.transformer_blocks.0.ff.net.0.proj.bias | [2 560] | F16 |
| up_blocks.3.attentions.2.transformer_blocks.0.ff.net.0.proj.weight | [2 560, 320] | F16 |
| up_blocks.3.attentions.2.transformer_blocks.0.ff.net.2.bias | [320] | F16 |
| up_blocks.3.attentions.2.transformer_blocks.0.ff.net.2.weight | [320, 1 280] | F16 |
| up_blocks.3.attentions.2.transformer_blocks.0.norm1.bias | [320] | F16 |
| up_blocks.3.attentions.2.transformer_blocks.0.norm1.weight | [320] | F16 |
| up_blocks.3.attentions.2.transformer_blocks.0.norm2.bias | [320] | F16 |
| up_blocks.3.attentions.2.transformer_blocks.0.norm2.weight | [320] | F16 |
| up_blocks.3.attentions.2.transformer_blocks.0.norm3.bias | [320] | F16 |
| up_blocks.3.attentions.2.transformer_blocks.0.norm3.weight | [320] | F16 |
| up_blocks.3.resnets.0.conv1.bias | [320] | F16 |
| up_blocks.3.resnets.0.conv1.weight | [320, 960, 3, 3] | F16 |
| up_blocks.3.resnets.0.conv2.bias | [320] | F16 |
| up_blocks.3.resnets.0.conv2.weight | [320, 320, 3, 3] | F16 |
| up_blocks.3.resnets.0.conv_shortcut.bias | [320] | F16 |
| up_blocks.3.resnets.0.conv_shortcut.weight | [320, 960, 1, 1] | F16 |
| up_blocks.3.resnets.0.norm1.bias | [960] | F16 |
| up_blocks.3.resnets.0.norm1.weight | [960] | F16 |
| up_blocks.3.resnets.0.norm2.bias | [320] | F16 |
| up_blocks.3.resnets.0.norm2.weight | [320] | F16 |
| up_blocks.3.resnets.0.time_emb_proj.bias | [320] | F16 |
| up_blocks.3.resnets.0.time_emb_proj.weight | [320, 1 280] | F16 |
| up_blocks.3.resnets.1.conv1.bias | [320] | F16 |
| up_blocks.3.resnets.1.conv1.weight | [320, 640, 3, 3] | F16 |
| up_blocks.3.resnets.1.conv2.bias | [320] | F16 |
| up_blocks.3.resnets.1.conv2.weight | [320, 320, 3, 3] | F16 |
| up_blocks.3.resnets.1.conv_shortcut.bias | [320] | F16 |
| up_blocks.3.resnets.1.conv_shortcut.weight | [320, 640, 1, 1] | F16 |
| up_blocks.3.resnets.1.norm1.bias | [640] | F16 |
| up_blocks.3.resnets.1.norm1.weight | [640] | F16 |
| up_blocks.3.resnets.1.norm2.bias | [320] | F16 |
| up_blocks.3.resnets.1.norm2.weight | [320] | F16 |
| up_blocks.3.resnets.1.time_emb_proj.bias | [320] | F16 |
| up_blocks.3.resnets.1.time_emb_proj.weight | [320, 1 280] | F16 |
| up_blocks.3.resnets.2.conv1.bias | [320] | F16 |
| up_blocks.3.resnets.2.conv1.weight | [320, 640, 3, 3] | F16 |
| up_blocks.3.resnets.2.conv2.bias | [320] | F16 |
| up_blocks.3.resnets.2.conv2.weight | [320, 320, 3, 3] | F16 |
| up_blocks.3.resnets.2.conv_shortcut.bias | [320] | F16 |
| up_blocks.3.resnets.2.conv_shortcut.weight | [320, 640, 1, 1] | F16 |
| up_blocks.3.resnets.2.norm1.bias | [640] | F16 |
| up_blocks.3.resnets.2.norm1.weight | [640] | F16 |
| up_blocks.3.resnets.2.norm2.bias | [320] | F16 |
| up_blocks.3.resnets.2.norm2.weight | [320] | F16 |
| up_blocks.3.resnets.2.time_emb_proj.bias | [320] | F16 |
| up_blocks.3.resnets.2.time_emb_proj.weight | [320, 1 280] | F16 |
@inproceedings{zhangAddingConditionalControl2023,
title = {Adding {{Conditional Control}} to {{Text-to-Image Diffusion Models}}},
booktitle = {2023 {{IEEE}}/{{CVF International Conference}} on {{Computer Vision}} ({{ICCV}})},
author = {Zhang, Lvmin and Rao, Anyi and Agrawala, Maneesh},
year = {2023},
month = oct,
pages = {3813--3824},
issn = {2380-7504},
doi = {10.1109/ICCV51070.2023.00355},
urldate = {2024-06-25},
abstract = {We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, e.g., edges, depth, segmentation, human pose, etc., with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small ({$<$}50k) and large ({$>$}1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.},
}
