-
Notifications
You must be signed in to change notification settings - Fork 937
Description
As an OpenMPI user, I noticed unexpected behavior when running MPI programs with Slurm’s srun.
- Environment:
module load OpenMPI/5.0.3 - What happens:
Using mpirun (works as expected)
Hello from proc 0 of 4
Hello from proc 1 of 4
Hello from proc 2 of 4
Hello from proc 3 of 4
Using srun --mpi=pmi2
No PMIx server was reachable, but a PMI1/2 was detected.
If srun is being used to launch application, 4 singletons will be started.
Hello from proc 0 of 1
Hello from proc 0 of 1
Hello from proc 0 of 1
Hello from proc 0 of 1
Using plain srun (without explicitly mentioning --mpi)
Hello from proc 0 of 1
Hello from proc 0 of 1
Hello from proc 0 of 1
Hello from proc 0 of 1
-
Why this is a problem: With
--mpi=pmi2, OpenMPI atleast prints a runtime information before falling back to singleton mode. With plainsrun, the same fallback happens but no warning is shown. As a user, this is very misleading: the job looks like a normal MPI run, but every process starts as a singletonrank 0 of 1, so no communication happens and resources are wasted. -
What I would expect: It would be more helpful if OpenMPI issued an error or warning whenever it cannot connect to PMI/PMIx under srun, rather than silently launching singletons, and showed the same warning even when
--mpi=is not explicitly specified.
This would prevent us from unintentionally running incorrect MPI jobs.