Skip to content

Running on Apple M1/M2/M3 chips #105

@alisonpeard

Description

@alisonpeard

Hi,
I'm trying to run the PyTorch training implementation on an Apple M2 chip with MPS. I can run StyleGAN-ADA image generation following these steps but when I try to train DiffAugment I get this error:

/AppleInternal/Library/BuildRoots/20d6c351-ee94-11ec-bcaf-7247572f23b4/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSNDArray/Kernels/MPSNDArrayConvolutionA14.mm:3237: failed assertion `destination datatype must be fp32'

My steps so far:

  1. Clone the repo and cd data-efficient-gans/DiffAugment-stylegan2-pytorch
  2. conda create -n DiffAug python=3.9
  3. conda activate DiffAug
  4. conda install pytorch torchvision torchaudio -c pytorch
  5. pip install click requests tqdm pyspng ninja imageio-ffmpeg==0.4.3
  6. pip install Pillow psutil scipy
  7. Following the advice here,
    • I replace all instances of torch.device('cuda') with torch.device('mps')
    • I replace random array generation with random_array = np.random.RandomState(seed).randn(1, G.z_dim).astype(np.float32) in generate.py as described.
    • In training_loop.py I replace instances of torch.cuda.Event(enable_timing=True) with time.perf_counter()
    • I remove torch.backends.cuda.matmul.allow_tf32 = allow_tf32, torch.backends.cudnn.allow_tf32 = allow_tf32, torch.cuda.reset_peak_memory_stats(), and all_gen_c = torch.from_numpy(np.stack(all_gen_c)).pin_memory().to(device) as they have no MPS equivalent.

At this point I can generate images from the pretrained model, e.g.,

python generate.py --outdir=out --trunc=1 --seeds=85,265,297,849 \
    --network=https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/metfaces.pkl

but training aborts with the message below:

python train.py --outdir=../training-runs --data=../datasets/100-shot-obama.zip --gpus=1 --kimg 1
# /AppleInternal/Library/BuildRoots/20d6c351-ee94-11ec-bcaf-#7247572f23b4/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSNDArray/Kernels/MPSNDArrayConvolutionA14.mm:3237: failed assertion `destination datatype must be fp32'

Using pdb I can trace the error from .../DiffAugment-stylegan2-pytorch/training/loss.py(80)accumulate_gradients() -> loss_Gmain.mean().mul(gain).backward():80
totorch/autograd/graph.py(769)_engine_run_backward() -> return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass but I'm not really able to figure out what's going on. I have checked all tensors in the training loop are float-32 type.

Any suggestions would be appreciated! I don't have access to NVIDIA GPUs at the moment and the Colab also seems to be outdated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions