Model Parallel assertion error


I put up a g4dn.12xlarge instance with 4 T4's, tried a command but ended up with AssertionError :/

`[ec2-user@ip ~]$ docker run --name stable-diffusion --gpus all -it -e DEVICES=0,1,2,3 -e MODEL_PARALLEL=1 -e TOKEN=token -p 7860:7860 nicklucche/stable-diffusion:multi-gpu
Loading model..
Looking for a valid assignment in which to split model parts to device(s): [0, 1, 2, 3]
Free GPU memory (per device):  [8665, 8665, 8665, 8665]
Search has found that 17 model(s) can be split over 4 device(s)!
Assignments: [{0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}]
Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0}
Creating and moving model parts to respective devices..
Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0}
Creating and moving model parts to respective devices..
Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0}
Creating and moving model parts to respective devices..
Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0}
Creating and moving model parts to respective devices..
Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0}
Creating and moving model parts to respective devices..
Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0}
Creating and moving model parts to respective devices..
Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0}
Creating and moving model parts to respective devices..
Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0}
Creating and moving model parts to respective devices..
Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0}
Creating and moving model parts to respective devices..
Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0}
Creating and moving model parts to respective devices..
Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0}
Creating and moving model parts to respective devices..
Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0}
Creating and moving model parts to respective devices..
Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0}
Creating and moving model parts to respective devices..
Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0}
Creating and moving model parts to respective devices..
Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0}
Creating and moving model parts to respective devices..
Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0}
Creating and moving model parts to respective devices..
Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0}
Creating and moving model parts to respective devices..
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.34k/1.34k [00:00<00:00, 739kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12.5k/12.5k [00:00<00:00, 12.9MB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 342/342 [00:00<00:00, 182kB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 543/543 [00:00<00:00, 307kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.63k/4.63k [00:00<00:00, 2.48MB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 608M/608M [00:07<00:00, 77.8MB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 209/209 [00:00<00:00, 117kB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 209/209 [00:00<00:00, 122kB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 572/572 [00:00<00:00, 317kB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 246M/246M [00:03<00:00, 72.5MB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 525k/525k [00:00<00:00, 58.8MB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 472/472 [00:00<00:00, 563kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 788/788 [00:00<00:00, 1.07MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.06M/1.06M [00:00<00:00, 62.3MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 772/772 [00:00<00:00, 1.07MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.72G/1.72G [00:22<00:00, 75.2MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 71.2k/71.2k [00:00<00:00, 37.7MB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 550/550 [00:00<00:00, 300kB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 167M/167M [00:02<00:00, 74.1MB/s]
Traceback (most recent call last):
  File "server.py", line 9, in <module>
    from main import inference, MP as model_parallel
  File "/app/main.py", line 55, in <module>
    n_procs, devices, model_parallel_assignment=model_ass, **kwargs
  File "/app/parallel.py", line 149, in from_pretrained
    assert d
AssertionError
`

It's loading a lot of models, 17 in fact. Might that be the culprit?

Anyways, if I can participate in testing or help in any way, I'm here to do so :)
Also wondering why it says only 8665MB of free memory when nvidia-smi told me I had 15360MiB per GPU free just before that.

_Originally posted by @huotarih in https://github.com/NickLucche/stable-diffusion-nvidia-docker/issues/8#issuecomment-1264480363_
      

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Model Parallel assertion error #17

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Model Parallel assertion error #17

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions