pml/ob1 remove mpi_init race condition #13559

Matthew-Whitlock · 2025-12-09T16:43:26Z

It is possible for the ob1 pml to receive a message on MPI_COMM_WORLD while waiting on a PMIx fence here:

ompi/ompi/runtime/ompi_mpi_init.c

Line 490 in 7c5a405

OMPI_LAZY_WAIT_FOR_COMPLETION(active);

This leads to a segfault, since MPI_COMM_WORLD is not set up in the pml until after the fence:

ompi/ompi/runtime/ompi_mpi_init.c

Line 496 in 7c5a405

MCA_PML_CALL(add_comm(&ompi_mpi_comm_world.comm));

This change adds the early messages to the non_existing_communicator_pending list, which is then checked when MPI_COMM_WORLD is added to the pml.

Signed-off-by: Matthew Whitlock <mwhitlo@sandia.gov>

bosilca · 2025-12-09T17:18:52Z

That fence is the modex fence, which means before that fence there is no knowledge about the peers, communicators or anything else related to data movements. In other terms there are no processes added to the PML/BML/BTL, so no endpoints so how is the data received ? I don't think this is right.

hppritcha · 2025-12-09T17:32:29Z

i could see that under some conditions one or more processes may exit the non-blocking fence sufficiently ahead of others that they could send a message before the intended receiver had add the self/world communicators.
@Matthew-Whitlock under what conditions are you observing this?

bosilca · 2025-12-09T17:34:48Z

Assuming that is true how come we didn't see such a critical issue in the last 10 years at any scale ?

bosilca

We need to hold on this until we understand what exactly is the issue here.

rhc54 · 2025-12-09T17:39:10Z

I always thought that is what the final fence at the end of mpi_init was for - to ensure that everyone completed add_procs before we enabled communication?

hppritcha · 2025-12-09T17:51:59Z

if mpi_async_mpi_init is true then there isn't a second pmix_fence_nb executed. that's part of the reason i asked @Matthew-Whitlock to give more info here about when he's hit this problem.

Matthew-Whitlock · 2025-12-09T18:04:01Z

I'm running against a version of PRRTE modified for fault tolerance, which could change some timing assumptions. I'd imagine (hope) that it doesn't add any significant timing delays, but that could be a possible source of an issue. I haven't tested with a totally clean install, but I'll do that now since y'all are saying this sounds unusual. Here's more of my current debug info, though.

My MPI parameters: -n 3300 --map-by ppr:100:node --output merge-stderr,file=.prints --mca btl tcp,sm,self --with-ft=ulfm --enable-recovery --prtemca prte_keepalive_time 30 --prtemca prte_keepalive_probes 4 --prtemca prte_keepalive_intvl 1 --mca plm slurm --mca coll ^han --mca smsc ^xpmem

And here's the GDB backtrace that pointed me towards this as the issue:

Thread 1 "lmp" received signal SIGSEGV, Segmentation fault.
0x00007ffff7d14458 in mca_pml_ob1_peer_lookup (comm=0xd68d20 <ompi_mpi_comm_world>, rank=3072) at <...>/ompi/mca/pml/ob1/pml_ob1_comm.h:99
99	    if( OPAL_UNLIKELY(rank >= (int)pml_comm->num_procs) ) {

#0  0x00007ffff7d14458 in mca_pml_ob1_peer_lookup (comm=0xd68d20 <ompi_mpi_comm_world>, rank=3072) at <...>/ompi/mca/pml/ob1/pml_ob1_comm.h:99
99	    if( OPAL_UNLIKELY(rank >= (int)pml_comm->num_procs) ) {
        pml_comm = 0x0
#1  0x00007ffff7d1683a in mca_pml_ob1_recv_frag_callback_match (btl=0xf993d0, descriptor=0x7fffffff2600) at <...>/ompi/mca/pml/ob1/pml_ob1_recvfrag.c:489
489	    proc = mca_pml_ob1_peer_lookup (comm_ptr, hdr->hdr_src);
        segments = 0x7ffff015eb88
        hdr = 0x7ffff015ec50
        comm_ptr = 0xd68d20 <ompi_mpi_comm_world>
        match = 0x0
        comm = 0x0
        proc = 0x0
        num_segments = 1
        bytes_received = 0
        __PRETTY_FUNCTION__ = "mca_pml_ob1_rec"...
#2  0x00007ffff72b4a89 in mca_btl_tcp_endpoint_recv_handler (sd=15, flags=2, user=0x100aa10) at <...>/opal/mca/btl/tcp/btl_tcp_endpoint.c:1044
1044	                reg->cbfunc(&frag->btl->super, &desc);
        reg = 0x7ffff7358950 <mca_btl_base_active_message_trigger+1040>
        desc = {
          endpoint = 0x100aa10,
          des_segments = 0x7ffff015eb88,
          des_segment_count = 1,
          tag = 65 'A',
          cbdata = 0x0
        }
        frag = 0x7ffff015eb00
        btl_endpoint = 0x100aa10
        __PRETTY_FUNCTION__ = "mca_btl_tcp_end"...
        __func__ = "mca_btl_tcp_end"...
#3  0x00007ffff6a27485 in event_persist_closure (ev=<optimized out>, base=0xdc97e0) at event.c:1623
1623	        (evcb_callback)(evcb_fd, evcb_res, evcb_arg);
        evcb_callback = <optimized out>
        evcb_fd = <optimized out>
        evcb_res = 2
        evcb_arg = 0x100aa10
        evcb_callback = <optimized out>
        evcb_fd = <optimized out>
        evcb_res = <optimized out>
        evcb_arg = <optimized out>
        __func__ = "event_persist_c"...
        run_at = <optimized out>
        relative_to = <optimized out>
        delay = <optimized out>
        now = <optimized out>
        usec_mask = <optimized out>
#4  event_process_active_single_queue (base=base@entry=0xdc97e0, activeq=0xdc9df0, max_to_process=max_to_process@entry=2147483647, endtime=endtime@entry=0x0) at event.c:1682
1682				event_persist_closure(base, ev);
        ev = <optimized out>
        evcb = <optimized out>
        count = 1
        __func__ = "event_process_a"...
#5  0x00007ffff6a27b0f in event_process_active (base=0xdc97e0) at event.c:1783
1783					c = event_process_active_single_queue(base, activeq,
        activeq = <optimized out>
        i = 4
        c = 0
        tv = {
          tv_sec = 16061063,
          tv_usec = 140737340837896
        }
        maxcb = 2147483647
        endtime = 0x0
        limit_after_prio = 2147483647
        activeq = <optimized out>
        i = <optimized out>
        c = <optimized out>
        endtime = <optimized out>
        tv = <optimized out>
        maxcb = <optimized out>
        limit_after_prio = <optimized out>
        done = <optimized out>
#6  event_base_loop (base=0xdc97e0, flags=2) at event.c:2006
2006				int n = event_process_active(base);
        n = <optimized out>
        evsel = 0x7ffff6a3eb80 <epollops>
        tv = {
          tv_sec = 0,
          tv_usec = 0
        }
        tv_p = <optimized out>
        res = <optimized out>
        done = 0
        retval = 0
        __func__ = "event_base_loop"
#7  0x00007ffff7210d40 in opal_progress_events () at <...>/opal/runtime/opal_progress.c:188
188	            events += opal_event_loop(opal_sync_event_base, opal_progress_event_flag);
        now = 30517406652018573
        lock = 1
        events = 0
#8  0x00007ffff7210df9 in opal_progress () at <...>/opal/runtime/opal_progress.c:240
240	        opal_progress_events();
        num_calls = 339
        i = 2
        events = 0
#9  0x00007ffff79e6e50 in ompi_mpi_init (argc=34, argv=0x7fffffff2e38, requested=0, provided=0x7fffffff2c98, reinit_ok=false) at <...>/ompi/runtime/ompi_mpi_init.c:490
490	            OMPI_LAZY_WAIT_FOR_COMPLETION(active);
        ret = 0
        error = 0x0
        evar = 0x0
        active = false
        background_fence = false
        info = {{
            key = "pmix.collect", '\000' <repeats 499 times>,
            flags = 0,
            value = {
              type = 1,
              data = {
                flag = true,
                byte = 1 '\001',
                string = 0x1 <error: Cannot access memory at address 0x1>,
                size = 1,
                pid = 1,
                integer = 1,
                int8 = 1 '\001',
                int16 = 1,
                int32 = 1,
                int64 = 1,
                uint = 1,
                uint8 = 1 '\001',
                uint16 = 1,
                uint32 = 1,
                uint64 = 1,
                fval = 1.40129846e-45,
                dval = 4.9406564584124654e-324,
                tv = {
                  tv_sec = 1,
                  tv_usec = 0
                },
                time = 1,
                status = 1,
                rank = 1,
                nspace = 0x1,
                proc = 0x1,
                bo = {
                  bytes = 0x1 <error: Cannot access memory at address 0x1>,
                  size = 0
                },
                persist = 1 '\001',
                scope = 1 '\001',
                range = 1 '\001',
                state = 1 '\001',
                pinfo = 0x1,
                darray = 0x1,
                ptr = 0x1,
                adir = 1 '\001',
                rbdir = 1 '\001',
                envar = {
                  envar = 0x1 <error: Cannot access memory at address 0x1>,
                  value = 0x0,
                  separator = 0 '\000'
                },
                coord = 0x1,
                linkstate = 1 '\001',
                jstate = 1 '\001',
                topo = 0x1,
                cpuset = 0x1,
                locality = 1,
                geometry = 0x1,
                devtype = 1,
                device = 0x1,
                devdist = 0x1,
                endpoint = 0x1,
                dbuf = 0x1,
                resunit = 0x1,
                nodepid = 0x1
              }
            }
          }, {
            key = '\000' <repeats 511 times>,
            flags = 0,
            value = {
              type = 0,
              data = {
                flag = false,
                byte = 0 '\000',
                string = 0x0,
                size = 0,
                pid = 0,
                integer = 0,
                int8 = 0 '\000',
                int16 = 0,
                int32 = 0,
                int64 = 0,
                uint = 0,
                uint8 = 0 '\000',
                uint16 = 0,
                uint32 = 0,
                uint64 = 0,
                fval = 0,
                dval = 0,
                tv = {
                  tv_sec = 0,
                  tv_usec = 0
                },
                time = 0,
                status = 0,
                rank = 0,
                nspace = 0x0,
                proc = 0x0,
                bo = {
                  bytes = 0x0,
                  size = 0
                },
                persist = 0 '\000',
                scope = 0 '\000',
                range = 0 '\000',
                state = 0 '\000',
                pinfo = 0x0,
                darray = 0x0,
                ptr = 0x0,
                adir = 0 '\000',
                rbdir = 0 '\000',
                envar = {
                  envar = 0x0,
                  value = 0x0,
                  separator = 0 '\000'
                },
                coord = 0x0,
                linkstate = 0 '\000',
                jstate = 0 '\000',
                topo = 0x0,
                cpuset = 0x0,
                locality = 0,
                geometry = 0x0,
                devtype = 0,
                device = 0x0,
                devdist = 0x0,
                endpoint = 0x0,
                dbuf = 0x0,
                resunit = 0x0,
                nodepid = 0x0
              }
            }
          }}
        rc = 0
        expected = 0
        desired = 1
        old_event_flags = 0
#10 0x00007ffff7a68de8 in PMPI_Init (argc=0x7fffffff2cbc, argv=0x7fffffff2cb0) at init_generated.c:64
64	        err = ompi_mpi_init(*argc, *argv, required, &provided, false);
        err = 32767
        provided = 0
        required = 0

bosilca · 2025-12-09T18:18:08Z

That code has two paths, one for async fence (where we do a single fence that starts before we create the PML comm associated with the world and completes after) and one for synchronous (where we do two fences, one before and after associating the world communicator). The code linked as part of this issue specifically pinpoint the synchronous branch, so we are talking about the code path where we do two fences (one before and one after the world communicator is properly populated).

So, how does a process send a message to a peer before having successfully completed the second fence, at which point we know that everyone has properly associated the PML communicator with the world? The only way this race condition can happen is if either we are not on code path described in the issue (aka the assumption of the issue is incorrect) or if the alterations to make PRRTE resilient broke the fence somehow.

Matthew-Whitlock · 2025-12-09T18:24:52Z

It looks like the the fence I'm describing is the second fence. The first one is here:

ompi/ompi/instance/instance.c

Line 765 in 7c5a405

OMPI_LAZY_WAIT_FOR_COMPLETION(active);

Called during init earlier in the function here:

ompi/ompi/runtime/ompi_mpi_init.c

Line 397 in 7c5a405

    
           ret = ompi_mpi_instance_init (*provided, &ompi_mpi_info_null.info.super, MPI_ERRORS_ARE_FATAL, &ompi_mpi_instance_default, argc, argv);

Which means there isn't a subsequent fence to block communication until after adding the pml communicator

Matthew-Whitlock · 2025-12-09T18:48:04Z

Nevermind, I'm blind

Matthew-Whitlock · 2025-12-09T18:52:15Z

Though, the code path I'm describing implies opal_pmix_base_async_modex = false, which is different from the ompi_async_mpi_init = false that leads to the fence

hppritcha · 2025-12-09T18:56:25Z

if you use unmodified prrte do you observe the behavior that motivated this PR?

rhc54 · 2025-12-09T19:28:47Z

Though, the code path I'm describing implies opal_pmix_base_async_modex = false, which is different from the ompi_async_mpi_init = false that leads to the fence

Correct - the two flags are independent. FWIW: we added the async_mpi_init flag some time ago when several of the networking companies were exploring the notion of "instant on". Since we pre-defined all contact info in the runtime, there was no need for the initial fence (hence async_modex). We found we could also eliminate the last fence if we modified the firmware/library to deal with the "unexpected message" - i.e., when I receive a message from someone before having defined an endpoint for them.

So the "async_mpi_init" flag was created to support that research. It was left because some folks felt that (a) it was off by default and therefore did no harm, (b) some of the library backends might be able to take advantage of it, and (c) some network providers might still have interest in it. If any of those are no longer true (and it has been awhile), then you are welcome to remove it. I can explain the results of the research separately if anyone is interested.

Matthew-Whitlock · 2025-12-09T20:03:57Z

I'm not seeing this on a clean build against main with internal dependencies. Sorry for the false alarm!

hppritcha · 2025-12-09T20:06:55Z

I would encourage you to test using an external pmix and prrte (both at head of respective masters) and verify you don't see this behavior using those versions.

pml/ob1 remove mpi_init race condition

2cdfc6d

Signed-off-by: Matthew Whitlock <mwhitlo@sandia.gov>

github-actions bot added the Target: main label Dec 9, 2025

hppritcha self-requested a review December 9, 2025 17:17

hppritcha approved these changes Dec 9, 2025

View reviewed changes

bosilca requested changes Dec 9, 2025

View reviewed changes

Matthew-Whitlock closed this Dec 9, 2025

pml/ob1 remove mpi_init race condition #13559

pml/ob1 remove mpi_init race condition #13559

Conversation

Matthew-Whitlock commented Dec 9, 2025

Uh oh!

bosilca commented Dec 9, 2025

Uh oh!

hppritcha commented Dec 9, 2025

Uh oh!

bosilca commented Dec 9, 2025

Uh oh!

bosilca left a comment

Choose a reason for hiding this comment

Uh oh!

rhc54 commented Dec 9, 2025

Uh oh!

hppritcha commented Dec 9, 2025

Uh oh!

Matthew-Whitlock commented Dec 9, 2025

Uh oh!

bosilca commented Dec 9, 2025

Uh oh!

Matthew-Whitlock commented Dec 9, 2025

Uh oh!

Matthew-Whitlock commented Dec 9, 2025

Uh oh!

Matthew-Whitlock commented Dec 9, 2025

Uh oh!

hppritcha commented Dec 9, 2025

Uh oh!

rhc54 commented Dec 9, 2025

Uh oh!

Matthew-Whitlock commented Dec 9, 2025

Uh oh!

hppritcha commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants