Skip to content

Commit 4a37e35

Browse files
authored
Fixes _compute_nproc_per_node in case of bad dist configuration (#2288)
Description: - Now uses new gloo group to compute nproc per node - Context: using NCCL and if user badly setups cuda per proc, idist will hang on _compute_nproc_per_node - Here is an example: https://app.circleci.com/pipelines/github/pytorch/ignite/2264/workflows/2e3073fd-0859-41c7-91e8-eef0f8eabee2/jobs/7060?invite=true#step-107-872 - However, I couldn't repro the issue on my setup cc @sdesrozier
1 parent 74dabca commit 4a37e35

File tree

2 files changed

+5
-8
lines changed

2 files changed

+5
-8
lines changed

.circleci/config.yml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,6 @@ run_pytorch_container: &run_pytorch_container
8686
docker run --gpus=all --rm -itd --shm-size 16G -v ${wd}:/ignite -w /ignite --name pthd << pipeline.parameters.pytorch_stable_image >>
8787
docker exec -it pthd nvidia-smi
8888
docker exec -it pthd ls
89-
docker exec -it pthd /bin/bash -c "$update_pth_cmd"
9089
9190
run_pytorch_devel_container: &run_pytorch_devel_container
9291
- run:
@@ -97,7 +96,6 @@ run_pytorch_devel_container: &run_pytorch_devel_container
9796
docker run --gpus=all --rm -itd --shm-size 16G -v ${wd}:/ignite -w /ignite --name pthd << pipeline.parameters.pytorch_stable_image_devel >>
9897
docker exec -it pthd nvidia-smi
9998
docker exec -it pthd ls
100-
docker exec -it pthd /bin/bash -c "$update_pth_cmd"
10199
102100
install_dependencies: &install_dependencies
103101
- run:

ignite/distributed/comp_models/native.py

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -150,12 +150,11 @@ def _init_from_context(self) -> None:
150150

151151
def _compute_nproc_per_node(self) -> int:
152152
local_rank = self.get_local_rank()
153-
device = torch.device("cpu")
154-
if torch.cuda.is_available():
155-
# we manually set cuda device to local rank in order to avoid a hang on all_reduce
156-
device = torch.device(f"cuda:{local_rank}")
157-
tensor = torch.tensor([self.get_local_rank() + 1]).to(device)
158-
dist.all_reduce(tensor, op=dist.ReduceOp.MAX)
153+
# Create new cpu group to get nproc_per_node such we avoid using
154+
# badly configured NCCL
155+
gloo_group = dist.new_group(backend="gloo")
156+
tensor = torch.tensor([local_rank + 1]).to("cpu")
157+
dist.all_reduce(tensor, op=dist.ReduceOp.MAX, group=gloo_group)
159158
return int(tensor.item())
160159

161160
def _get_all_hostnames(self) -> List[Tuple[str, ...]]:

0 commit comments

Comments
 (0)