Skip to content

Conversation

@Matthew-Whitlock
Copy link
Contributor

It is possible for the ob1 pml to receive a message on MPI_COMM_WORLD while waiting on a PMIx fence here:

OMPI_LAZY_WAIT_FOR_COMPLETION(active);

This leads to a segfault, since MPI_COMM_WORLD is not set up in the pml until after the fence:

MCA_PML_CALL(add_comm(&ompi_mpi_comm_world.comm));

This change adds the early messages to the non_existing_communicator_pending list, which is then checked when MPI_COMM_WORLD is added to the pml.

Signed-off-by: Matthew Whitlock <mwhitlo@sandia.gov>
@hppritcha hppritcha self-requested a review December 9, 2025 17:17
@bosilca
Copy link
Member

bosilca commented Dec 9, 2025

That fence is the modex fence, which means before that fence there is no knowledge about the peers, communicators or anything else related to data movements. In other terms there are no processes added to the PML/BML/BTL, so no endpoints so how is the data received ? I don't think this is right.

@hppritcha
Copy link
Member

i could see that under some conditions one or more processes may exit the non-blocking fence sufficiently ahead of others that they could send a message before the intended receiver had add the self/world communicators.
@Matthew-Whitlock under what conditions are you observing this?

@bosilca
Copy link
Member

bosilca commented Dec 9, 2025

Assuming that is true how come we didn't see such a critical issue in the last 10 years at any scale ?

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to hold on this until we understand what exactly is the issue here.

@rhc54
Copy link
Contributor

rhc54 commented Dec 9, 2025

I always thought that is what the final fence at the end of mpi_init was for - to ensure that everyone completed add_procs before we enabled communication?

@hppritcha
Copy link
Member

if mpi_async_mpi_init is true then there isn't a second pmix_fence_nb executed. that's part of the reason i asked @Matthew-Whitlock to give more info here about when he's hit this problem.

@Matthew-Whitlock
Copy link
Contributor Author

I'm running against a version of PRRTE modified for fault tolerance, which could change some timing assumptions. I'd imagine (hope) that it doesn't add any significant timing delays, but that could be a possible source of an issue. I haven't tested with a totally clean install, but I'll do that now since y'all are saying this sounds unusual. Here's more of my current debug info, though.

My MPI parameters: -n 3300 --map-by ppr:100:node --output merge-stderr,file=.prints --mca btl tcp,sm,self --with-ft=ulfm --enable-recovery --prtemca prte_keepalive_time 30 --prtemca prte_keepalive_probes 4 --prtemca prte_keepalive_intvl 1 --mca plm slurm --mca coll ^han --mca smsc ^xpmem

And here's the GDB backtrace that pointed me towards this as the issue:

Thread 1 "lmp" received signal SIGSEGV, Segmentation fault.
0x00007ffff7d14458 in mca_pml_ob1_peer_lookup (comm=0xd68d20 <ompi_mpi_comm_world>, rank=3072) at <...>/ompi/mca/pml/ob1/pml_ob1_comm.h:99
99	    if( OPAL_UNLIKELY(rank >= (int)pml_comm->num_procs) ) {

#0  0x00007ffff7d14458 in mca_pml_ob1_peer_lookup (comm=0xd68d20 <ompi_mpi_comm_world>, rank=3072) at <...>/ompi/mca/pml/ob1/pml_ob1_comm.h:99
99	    if( OPAL_UNLIKELY(rank >= (int)pml_comm->num_procs) ) {
        pml_comm = 0x0
#1  0x00007ffff7d1683a in mca_pml_ob1_recv_frag_callback_match (btl=0xf993d0, descriptor=0x7fffffff2600) at <...>/ompi/mca/pml/ob1/pml_ob1_recvfrag.c:489
489	    proc = mca_pml_ob1_peer_lookup (comm_ptr, hdr->hdr_src);
        segments = 0x7ffff015eb88
        hdr = 0x7ffff015ec50
        comm_ptr = 0xd68d20 <ompi_mpi_comm_world>
        match = 0x0
        comm = 0x0
        proc = 0x0
        num_segments = 1
        bytes_received = 0
        __PRETTY_FUNCTION__ = "mca_pml_ob1_rec"...
#2  0x00007ffff72b4a89 in mca_btl_tcp_endpoint_recv_handler (sd=15, flags=2, user=0x100aa10) at <...>/opal/mca/btl/tcp/btl_tcp_endpoint.c:1044
1044	                reg->cbfunc(&frag->btl->super, &desc);
        reg = 0x7ffff7358950 <mca_btl_base_active_message_trigger+1040>
        desc = {
          endpoint = 0x100aa10,
          des_segments = 0x7ffff015eb88,
          des_segment_count = 1,
          tag = 65 'A',
          cbdata = 0x0
        }
        frag = 0x7ffff015eb00
        btl_endpoint = 0x100aa10
        __PRETTY_FUNCTION__ = "mca_btl_tcp_end"...
        __func__ = "mca_btl_tcp_end"...
#3  0x00007ffff6a27485 in event_persist_closure (ev=<optimized out>, base=0xdc97e0) at event.c:1623
1623	        (evcb_callback)(evcb_fd, evcb_res, evcb_arg);
        evcb_callback = <optimized out>
        evcb_fd = <optimized out>
        evcb_res = 2
        evcb_arg = 0x100aa10
        evcb_callback = <optimized out>
        evcb_fd = <optimized out>
        evcb_res = <optimized out>
        evcb_arg = <optimized out>
        __func__ = "event_persist_c"...
        run_at = <optimized out>
        relative_to = <optimized out>
        delay = <optimized out>
        now = <optimized out>
        usec_mask = <optimized out>
#4  event_process_active_single_queue (base=base@entry=0xdc97e0, activeq=0xdc9df0, max_to_process=max_to_process@entry=2147483647, endtime=endtime@entry=0x0) at event.c:1682
1682				event_persist_closure(base, ev);
        ev = <optimized out>
        evcb = <optimized out>
        count = 1
        __func__ = "event_process_a"...
#5  0x00007ffff6a27b0f in event_process_active (base=0xdc97e0) at event.c:1783
1783					c = event_process_active_single_queue(base, activeq,
        activeq = <optimized out>
        i = 4
        c = 0
        tv = {
          tv_sec = 16061063,
          tv_usec = 140737340837896
        }
        maxcb = 2147483647
        endtime = 0x0
        limit_after_prio = 2147483647
        activeq = <optimized out>
        i = <optimized out>
        c = <optimized out>
        endtime = <optimized out>
        tv = <optimized out>
        maxcb = <optimized out>
        limit_after_prio = <optimized out>
        done = <optimized out>
#6  event_base_loop (base=0xdc97e0, flags=2) at event.c:2006
2006				int n = event_process_active(base);
        n = <optimized out>
        evsel = 0x7ffff6a3eb80 <epollops>
        tv = {
          tv_sec = 0,
          tv_usec = 0
        }
        tv_p = <optimized out>
        res = <optimized out>
        done = 0
        retval = 0
        __func__ = "event_base_loop"
#7  0x00007ffff7210d40 in opal_progress_events () at <...>/opal/runtime/opal_progress.c:188
188	            events += opal_event_loop(opal_sync_event_base, opal_progress_event_flag);
        now = 30517406652018573
        lock = 1
        events = 0
#8  0x00007ffff7210df9 in opal_progress () at <...>/opal/runtime/opal_progress.c:240
240	        opal_progress_events();
        num_calls = 339
        i = 2
        events = 0
#9  0x00007ffff79e6e50 in ompi_mpi_init (argc=34, argv=0x7fffffff2e38, requested=0, provided=0x7fffffff2c98, reinit_ok=false) at <...>/ompi/runtime/ompi_mpi_init.c:490
490	            OMPI_LAZY_WAIT_FOR_COMPLETION(active);
        ret = 0
        error = 0x0
        evar = 0x0
        active = false
        background_fence = false
        info = {{
            key = "pmix.collect", '\000' <repeats 499 times>,
            flags = 0,
            value = {
              type = 1,
              data = {
                flag = true,
                byte = 1 '\001',
                string = 0x1 <error: Cannot access memory at address 0x1>,
                size = 1,
                pid = 1,
                integer = 1,
                int8 = 1 '\001',
                int16 = 1,
                int32 = 1,
                int64 = 1,
                uint = 1,
                uint8 = 1 '\001',
                uint16 = 1,
                uint32 = 1,
                uint64 = 1,
                fval = 1.40129846e-45,
                dval = 4.9406564584124654e-324,
                tv = {
                  tv_sec = 1,
                  tv_usec = 0
                },
                time = 1,
                status = 1,
                rank = 1,
                nspace = 0x1,
                proc = 0x1,
                bo = {
                  bytes = 0x1 <error: Cannot access memory at address 0x1>,
                  size = 0
                },
                persist = 1 '\001',
                scope = 1 '\001',
                range = 1 '\001',
                state = 1 '\001',
                pinfo = 0x1,
                darray = 0x1,
                ptr = 0x1,
                adir = 1 '\001',
                rbdir = 1 '\001',
                envar = {
                  envar = 0x1 <error: Cannot access memory at address 0x1>,
                  value = 0x0,
                  separator = 0 '\000'
                },
                coord = 0x1,
                linkstate = 1 '\001',
                jstate = 1 '\001',
                topo = 0x1,
                cpuset = 0x1,
                locality = 1,
                geometry = 0x1,
                devtype = 1,
                device = 0x1,
                devdist = 0x1,
                endpoint = 0x1,
                dbuf = 0x1,
                resunit = 0x1,
                nodepid = 0x1
              }
            }
          }, {
            key = '\000' <repeats 511 times>,
            flags = 0,
            value = {
              type = 0,
              data = {
                flag = false,
                byte = 0 '\000',
                string = 0x0,
                size = 0,
                pid = 0,
                integer = 0,
                int8 = 0 '\000',
                int16 = 0,
                int32 = 0,
                int64 = 0,
                uint = 0,
                uint8 = 0 '\000',
                uint16 = 0,
                uint32 = 0,
                uint64 = 0,
                fval = 0,
                dval = 0,
                tv = {
                  tv_sec = 0,
                  tv_usec = 0
                },
                time = 0,
                status = 0,
                rank = 0,
                nspace = 0x0,
                proc = 0x0,
                bo = {
                  bytes = 0x0,
                  size = 0
                },
                persist = 0 '\000',
                scope = 0 '\000',
                range = 0 '\000',
                state = 0 '\000',
                pinfo = 0x0,
                darray = 0x0,
                ptr = 0x0,
                adir = 0 '\000',
                rbdir = 0 '\000',
                envar = {
                  envar = 0x0,
                  value = 0x0,
                  separator = 0 '\000'
                },
                coord = 0x0,
                linkstate = 0 '\000',
                jstate = 0 '\000',
                topo = 0x0,
                cpuset = 0x0,
                locality = 0,
                geometry = 0x0,
                devtype = 0,
                device = 0x0,
                devdist = 0x0,
                endpoint = 0x0,
                dbuf = 0x0,
                resunit = 0x0,
                nodepid = 0x0
              }
            }
          }}
        rc = 0
        expected = 0
        desired = 1
        old_event_flags = 0
#10 0x00007ffff7a68de8 in PMPI_Init (argc=0x7fffffff2cbc, argv=0x7fffffff2cb0) at init_generated.c:64
64	        err = ompi_mpi_init(*argc, *argv, required, &provided, false);
        err = 32767
        provided = 0
        required = 0

@bosilca
Copy link
Member

bosilca commented Dec 9, 2025

That code has two paths, one for async fence (where we do a single fence that starts before we create the PML comm associated with the world and completes after) and one for synchronous (where we do two fences, one before and after associating the world communicator). The code linked as part of this issue specifically pinpoint the synchronous branch, so we are talking about the code path where we do two fences (one before and one after the world communicator is properly populated).

So, how does a process send a message to a peer before having successfully completed the second fence, at which point we know that everyone has properly associated the PML communicator with the world? The only way this race condition can happen is if either we are not on code path described in the issue (aka the assumption of the issue is incorrect) or if the alterations to make PRRTE resilient broke the fence somehow.

@Matthew-Whitlock
Copy link
Contributor Author

It looks like the the fence I'm describing is the second fence. The first one is here:

OMPI_LAZY_WAIT_FOR_COMPLETION(active);

Called during init earlier in the function here:

ret = ompi_mpi_instance_init (*provided, &ompi_mpi_info_null.info.super, MPI_ERRORS_ARE_FATAL, &ompi_mpi_instance_default, argc, argv);

Which means there isn't a subsequent fence to block communication until after adding the pml communicator

@Matthew-Whitlock
Copy link
Contributor Author

Nevermind, I'm blind

@Matthew-Whitlock
Copy link
Contributor Author

Though, the code path I'm describing implies opal_pmix_base_async_modex = false, which is different from the ompi_async_mpi_init = false that leads to the fence

@hppritcha
Copy link
Member

if you use unmodified prrte do you observe the behavior that motivated this PR?

@rhc54
Copy link
Contributor

rhc54 commented Dec 9, 2025

Though, the code path I'm describing implies opal_pmix_base_async_modex = false, which is different from the ompi_async_mpi_init = false that leads to the fence

Correct - the two flags are independent. FWIW: we added the async_mpi_init flag some time ago when several of the networking companies were exploring the notion of "instant on". Since we pre-defined all contact info in the runtime, there was no need for the initial fence (hence async_modex). We found we could also eliminate the last fence if we modified the firmware/library to deal with the "unexpected message" - i.e., when I receive a message from someone before having defined an endpoint for them.

So the "async_mpi_init" flag was created to support that research. It was left because some folks felt that (a) it was off by default and therefore did no harm, (b) some of the library backends might be able to take advantage of it, and (c) some network providers might still have interest in it. If any of those are no longer true (and it has been awhile), then you are welcome to remove it. I can explain the results of the research separately if anyone is interested.

@Matthew-Whitlock
Copy link
Contributor Author

I'm not seeing this on a clean build against main with internal dependencies. Sorry for the false alarm!

@hppritcha
Copy link
Member

I would encourage you to test using an external pmix and prrte (both at head of respective masters) and verify you don't see this behavior using those versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants