Extremly Long Training Time

Hi,

I’m training the SR3 model on my dataset, which consists of 2000 training samples and 100 validation samples. I’m using 4 A100 GPUs, but it’s taking over 3 hours to complete just 100 iterations. Is this normal? Do you have any suggestions to improve the training speed?

Thanks

Here is my training config:
  phase: train
  gpu_ids: [0, 1, 2, 3]
  path:[
    log: experiments/sr_ffhq_241113_184019/logs
    tb_logger: experiments/sr_ffhq_241113_184019/tb_logger
    results: experiments/sr_ffhq_241113_184019/results
    checkpoint: experiments/sr_ffhq_241113_184019/checkpoint
    resume_state: None
    experiments_root: experiments/sr_ffhq_241113_184019
  ]
  datasets:[
    train:[
      name: FLAIR_SR_Train
      mode: LRHR
      dataroot: dataset/train_224_320
      datatype: img
      l_resolution: 224
      r_resolution: 320
      batch_size: 8
      num_workers: 8
      use_shuffle: True
      data_len: -1
    ]
    val:[
      name: FLAIR_SR_Val
      mode: LRHR
      dataroot: /dataset/val_224_320
      datatype: img
      l_resolution: 224
      r_resolution: 320
      data_len: 3
    ]
  ]
  model:[
    which_model_G: sr3
    finetune_norm: False
    unet:[
      in_channel: 6
      out_channel: 3
      inner_channel: 64
      channel_multiplier: [1, 2, 4, 8, 8]
      attn_res: []
      res_blocks: 1
      dropout: 0.2
    ]
    beta_schedule:[
      train:[
        schedule: linear
        n_timestep: 2000
        linear_start: 1e-06
        linear_end: 0.01
      ]
      val:[
        schedule: linear
        n_timestep: 2000
        linear_start: 1e-06
        linear_end: 0.01
      ]
    ]
    diffusion:[
      image_size: 320
      channels: 3
      conditional: True
    ]
  ]
  train:[
    n_iter: 1000000
    val_freq: 1000.0
    save_checkpoint_freq: 1000.0
    print_freq: 50
    optimizer:[
      type: adam
      lr: 3e-06
    ]
    ema_scheduler:[
      step_start_ema: 5000
      update_ema_every: 1
      ema_decay: 0.9999
    ]
  ]
  wandb:[
    project: super_resolution_flair
  ]
  distributed: True
  log_wandb_ckpt: False
  log_eval: False
  enable_wandb: False

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extremly Long Training Time #155

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Extremly Long Training Time #155

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions