-
Notifications
You must be signed in to change notification settings - Fork 937
pml/ob1 remove mpi_init race condition #13559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Matthew Whitlock <mwhitlo@sandia.gov>
|
That fence is the modex fence, which means before that fence there is no knowledge about the peers, communicators or anything else related to data movements. In other terms there are no processes added to the PML/BML/BTL, so no endpoints so how is the data received ? I don't think this is right. |
|
i could see that under some conditions one or more processes may exit the non-blocking fence sufficiently ahead of others that they could send a message before the intended receiver had add the self/world communicators. |
|
Assuming that is true how come we didn't see such a critical issue in the last 10 years at any scale ? |
bosilca
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to hold on this until we understand what exactly is the issue here.
|
I always thought that is what the final fence at the end of mpi_init was for - to ensure that everyone completed add_procs before we enabled communication? |
|
if |
|
I'm running against a version of PRRTE modified for fault tolerance, which could change some timing assumptions. I'd imagine (hope) that it doesn't add any significant timing delays, but that could be a possible source of an issue. I haven't tested with a totally clean install, but I'll do that now since y'all are saying this sounds unusual. Here's more of my current debug info, though. My MPI parameters: And here's the GDB backtrace that pointed me towards this as the issue: |
|
That code has two paths, one for async fence (where we do a single fence that starts before we create the PML comm associated with the world and completes after) and one for synchronous (where we do two fences, one before and after associating the world communicator). The code linked as part of this issue specifically pinpoint the synchronous branch, so we are talking about the code path where we do two fences (one before and one after the world communicator is properly populated). So, how does a process send a message to a peer before having successfully completed the second fence, at which point we know that everyone has properly associated the PML communicator with the world? The only way this race condition can happen is if either we are not on code path described in the issue (aka the assumption of the issue is incorrect) or if the alterations to make PRRTE resilient broke the fence somehow. |
|
It looks like the the fence I'm describing is the second fence. The first one is here: Line 765 in 7c5a405
Called during init earlier in the function here: ompi/ompi/runtime/ompi_mpi_init.c Line 397 in 7c5a405
Which means there isn't a subsequent fence to block communication until after adding the pml communicator |
|
Nevermind, I'm blind |
|
Though, the code path I'm describing implies |
|
if you use unmodified prrte do you observe the behavior that motivated this PR? |
Correct - the two flags are independent. FWIW: we added the async_mpi_init flag some time ago when several of the networking companies were exploring the notion of "instant on". Since we pre-defined all contact info in the runtime, there was no need for the initial fence (hence async_modex). We found we could also eliminate the last fence if we modified the firmware/library to deal with the "unexpected message" - i.e., when I receive a message from someone before having defined an endpoint for them. So the "async_mpi_init" flag was created to support that research. It was left because some folks felt that (a) it was off by default and therefore did no harm, (b) some of the library backends might be able to take advantage of it, and (c) some network providers might still have interest in it. If any of those are no longer true (and it has been awhile), then you are welcome to remove it. I can explain the results of the research separately if anyone is interested. |
|
I'm not seeing this on a clean build against main with internal dependencies. Sorry for the false alarm! |
|
I would encourage you to test using an external pmix and prrte (both at head of respective masters) and verify you don't see this behavior using those versions. |
It is possible for the ob1 pml to receive a message on MPI_COMM_WORLD while waiting on a PMIx fence here:
ompi/ompi/runtime/ompi_mpi_init.c
Line 490 in 7c5a405
This leads to a segfault, since MPI_COMM_WORLD is not set up in the pml until after the fence:
ompi/ompi/runtime/ompi_mpi_init.c
Line 496 in 7c5a405
This change adds the early messages to the non_existing_communicator_pending list, which is then checked when MPI_COMM_WORLD is added to the pml.