Skip to content

srun with OpenMPI 5.0.3 unexpectedly launches MPI jobs as singletons without ERROR #13523

@GayatriManda

Description

@GayatriManda

As an OpenMPI user, I noticed unexpected behavior when running MPI programs with Slurm’s srun.

  • Environment: module load OpenMPI/5.0.3
  • What happens:

Using mpirun (works as expected)

Hello from proc 0 of 4
Hello from proc 1 of 4
Hello from proc 2 of 4
Hello from proc 3 of 4

Using srun --mpi=pmi2

No PMIx server was reachable, but a PMI1/2 was detected.
If srun is being used to launch application, 4 singletons will be started.
Hello from proc 0 of 1
Hello from proc 0 of 1
Hello from proc 0 of 1
Hello from proc 0 of 1

Using plain srun (without explicitly mentioning --mpi)

Hello from proc 0 of 1
Hello from proc 0 of 1
Hello from proc 0 of 1
Hello from proc 0 of 1
  • Why this is a problem: With --mpi=pmi2, OpenMPI atleast prints a runtime information before falling back to singleton mode. With plain srun, the same fallback happens but no warning is shown. As a user, this is very misleading: the job looks like a normal MPI run, but every process starts as a singleton rank 0 of 1, so no communication happens and resources are wasted.

  • What I would expect: It would be more helpful if OpenMPI issued an error or warning whenever it cannot connect to PMI/PMIx under srun, rather than silently launching singletons, and showed the same warning even when --mpi= is not explicitly specified.

This would prevent us from unintentionally running incorrect MPI jobs.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions