Skip to content

Extremly Long Training Time #155

@HarryHuangYZ

Description

@HarryHuangYZ

Hi,

I’m training the SR3 model on my dataset, which consists of 2000 training samples and 100 validation samples. I’m using 4 A100 GPUs, but it’s taking over 3 hours to complete just 100 iterations. Is this normal? Do you have any suggestions to improve the training speed?

Thanks

Here is my training config:
phase: train
gpu_ids: [0, 1, 2, 3]
path:[
log: experiments/sr_ffhq_241113_184019/logs
tb_logger: experiments/sr_ffhq_241113_184019/tb_logger
results: experiments/sr_ffhq_241113_184019/results
checkpoint: experiments/sr_ffhq_241113_184019/checkpoint
resume_state: None
experiments_root: experiments/sr_ffhq_241113_184019
]
datasets:[
train:[
name: FLAIR_SR_Train
mode: LRHR
dataroot: dataset/train_224_320
datatype: img
l_resolution: 224
r_resolution: 320
batch_size: 8
num_workers: 8
use_shuffle: True
data_len: -1
]
val:[
name: FLAIR_SR_Val
mode: LRHR
dataroot: /dataset/val_224_320
datatype: img
l_resolution: 224
r_resolution: 320
data_len: 3
]
]
model:[
which_model_G: sr3
finetune_norm: False
unet:[
in_channel: 6
out_channel: 3
inner_channel: 64
channel_multiplier: [1, 2, 4, 8, 8]
attn_res: []
res_blocks: 1
dropout: 0.2
]
beta_schedule:[
train:[
schedule: linear
n_timestep: 2000
linear_start: 1e-06
linear_end: 0.01
]
val:[
schedule: linear
n_timestep: 2000
linear_start: 1e-06
linear_end: 0.01
]
]
diffusion:[
image_size: 320
channels: 3
conditional: True
]
]
train:[
n_iter: 1000000
val_freq: 1000.0
save_checkpoint_freq: 1000.0
print_freq: 50
optimizer:[
type: adam
lr: 3e-06
]
ema_scheduler:[
step_start_ema: 5000
update_ema_every: 1
ema_decay: 0.9999
]
]
wandb:[
project: super_resolution_flair
]
distributed: True
log_wandb_ckpt: False
log_eval: False
enable_wandb: False

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions