gh-130895: fix multiprocessing.Process join/wait/poll races #131440

duaneg · 2025-03-19T02:06:48Z

This bug is caused by race conditions in the poll implementations (which are called by join/wait) where if multiple threads try to reap the dead process only one "wins" and gets the exit code, while the others get an error.

In the forkserver implementation the losing thread(s) set the code to an error, possibly overwriting the correct code set by the winning thread. This is relatively easy to fix: we can just take a lock before waiting for the process, since at that point we know the call should not block.

In the fork and spawn implementations the losers of the race return before the exit code is set, meaning the process may still report itself as alive after join returns. Fixing this is trickier as we have to support a mixture of blocking and non-blocking calls to poll, and we cannot have the latter waiting to take a lock held by the former.

The approach taken is to split the blocking and non-blocking call variants. The non-blocking variant does its work with the lock held: since it won't block this should be safe. The blocking variant releases the lock before making the blocking operating system call. It then retakes the lock and either sets the code if it wins or waits for a potentially racing thread to do so otherwise.

If a non-blocking call is racing with the unlocked part of a blocking call it may still "lose" the race, and return None instead of the exit code, even though the process is dead. However, as the process could be alive at the time the call is made but die immediately afterwards, this situation should already be handled by correctly written code.

To verify the behaviour a test is added which reliably triggers failures for all three implementations. A work-around for this bug in a test added for gh-128041 is also reverted.

Issue: multiprocessing.Process.is_alive() can incorrectly return True after join() #130895

This bug is caused by race conditions in the poll implementations (which are called by join/wait) where if multiple threads try to reap the dead process only one "wins" and gets the exit code, while the others get an error. In the forkserver implementation the losing thread(s) set the code to an error, possibly overwriting the correct code set by the winning thread. This is relatively easy to fix: we can just take a lock before waiting for the process, since at that point we know the call should not block. In the fork and spawn implementations the losers of the race return before the exit code is set, meaning the process may still report itself as alive after join returns. Fixing this is trickier as we have to support a mixture of blocking and non-blocking calls to poll, and we cannot have the latter waiting to take a lock held by the former. The approach taken is to split the blocking and non-blocking call variants. The non-blocking variant does its work with the lock held: since it won't block this should be safe. The blocking variant releases the lock before making the blocking operating system call. It then retakes the lock and either sets the code if it wins or waits for a potentially racing thread to do so otherwise. If a non-blocking call is racing with the unlocked part of a blocking call it may still "lose" the race, and return None instead of the exit code, even though the process is dead. However, as the process could be alive at the time the call is made but die immediately afterwards, this situation should already be handled by correctly written code. To verify the behaviour a test is added which reliably triggers failures for all three implementations. A work-around for this bug in a test added for pythongh-128041 is also reverted.

ghost · 2025-03-19T02:06:51Z

All commit authors signed the Contributor License Agreement.

bedevere-app · 2025-03-19T02:06:52Z

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

Lib/multiprocessing/popen_forkserver.py

Lib/multiprocessing/popen_fork.py

returning now, except if we raced with another thread that set it just after our timeout expired.

duaneg · 2025-11-03T00:05:59Z

It might be a good idea to open a PR with just the fix for forkserver under #140867. That issue describes the problem with forkserver better, and the fix for forkserver is much simpler and safer than the fix for fork/spawn. It makes sense to treat them separately.

@zmedico if you want to do that, please go ahead, and feel free to copy the unit test from this PR if it is helpful 🙂

YvesDup · 2025-11-25T14:19:58Z

Lib/multiprocessing/popen_fork.py

+
+        with self._exit_condition:
+            self._exit_blockers -= 1
            if pid == self.pid:


What do you think about addding the following test, just between the increment and the pid test:

if self.returncode is not None: return self.returncode

I suggest this change because while the condition RLock was released, the self.returncode attribute could have been updated.

I don't think so, but it is very possible I've missed something, this is all (way too) subtle...

If the thread has won the race and os.waitpid returned the exit code then the returncode cannot have been set: only one thread will get a valid exit code, so if this one has it no other one can. In that case the if would always fail.

If the thread did not win the race then the winner might have set it, but in that case it isn't enough to just exit immediately. We need to notify in case there was another blocked thread which didn't see the returncode and is waiting.

I.e. consider the following scenario:

Threads 1 & 2 are blocked doing an os.waitpid in poll.

Thread 3 does a non-blocking poll which "wins" and gets the status code (but has not yet set it!)

Thread 1 wakes up. It has lost the race, returncode hasn't yet been set, and there is another blocker, so it goes into the while loop and waits.

Thread 3 sets the returncode.

Thread 2 wakes up. If we add the if it will exit without notifying thread 1, which will then be blocked indefinitely.

If the thread has won the race and os.waitpid returned the exit code then the returncode cannot have been set:

Are you describe a case of blocking poll ?
In case of non-blocking poll, the self._set_returncode method is always called in a protected block via the condition RLock. The set and notify operations are always executed together in the self._set_returncode method .

only one thread will get a valid exit code, so if this one has it no other one can.

I agree

In that case the if would always fail.

Yes it would failed, but only for the current thread which will continue to run. And the next threads, should not be failed if self.return code is set correctly.

If the thread did not win the race then the winner might have set it, but in that case it isn't enough to just exit immediately. We need to notify in case there was another blocked thread which didn't see the returncode and is waiting.

IMO the notification was already sent from the last call to theself._set_returncode method. Even though I have my doubts, if a new notification is really necessary - when self._exit_blockers is zero ? - the new if is no longer a relevant option.

2. Thread 3 does a non-blocking poll which "wins" and gets the status code (but has not yet set it!)

Regarding my first command, I guess yout talk about a blocking poll.

When I ran your test_racing_joins test with the new if, all seems okay, I never noticed problem with missing notifications. We can check together if you wish.

Your fix seems okay to me. I was just wondering if it was possible to avoid executing the second protected block entirely.
Thank you for your time.

If the thread did not win the race then the winner might have set it, but in that case it isn't enough to just exit immediately. We need to notify in case there was another blocked thread which didn't see the returncode and is waiting.

IMO the notification was already sent from the last call to theself._set_returncode method. Even though I have my doubts, if a new notification is really necessary - when self._exit_blockers is zero ? - the new if is no longer a relevant option.

You make a good point, and I made a mistake in my analysis above. The early exit you suggest is indeed safe if another racing poll has won and set the return code, since they will have also notified. The problematic case is the pathological one, where some other code entirely called os.waitpid and won the race. In that case _set_returncode is not called and hence we need that backup notification to avoid a hang.

In that scenario the Popen object is broken, since it has no way to get the child's exit code. Methods like is_alive() and mp mechanisms downstream of Popen will give the wrong result (there are a bunch of open issues that boil down to this). It is a "user error" situation, but we should still try and make everything fail as gracefully as possible (i.e. not hang 😉).

So, we could put the early exit in. Ultimately I think the code will behave correctly with or without it. Personally, I think the code reads slightly simpler without it, just in basic terms of fewer conditionals/less cyclomatic complexity, but YMMV on that. If there is a general consensus otherwise I'll happily add it.

When I ran your test_racing_joins test with the new if, all seems okay, I never noticed problem with missing notifications. We can check together if you wish.

Thanks, I would love to collaborate on this! The existing test case does not test the pathological behaviour, so it will indeed miss this. Here is a new test script that does; it hangs reliably for me if the last blocker notification is removed:

import multiprocessing as mp import os import threading N=8 def wait(barrier, p): barrier.wait() p.join() # WRONG due to losing the child's status assert p.exitcode is None assert p.is_alive() def race(): pbarrier = mp.Barrier(2) # Ensure child has completed p = mp.Process(target=pbarrier.wait, args=(pbarrier,)) p.start() pbarrier.wait() # Steal the child's exit code pid, sts = os.waitpid(p.pid, 0) assert pid == p.pid assert sts == 0 # Since the child is dead, these should not block... tbarrier = threading.Barrier(N) threads = [threading.Thread(target=wait, args=(tbarrier, p)) for _ in range(N)] for t in threads: t.start() for t in threads: t.join() if __name__ == "__main__": mp.set_start_method('fork') race()

Does that reproduce for you? I guess it should be tidied up and turned into a unit test...

Your fix seems okay to me. I was just wondering if it was possible to avoid executing the second protected block entirely. Thank you for your time.

Thanks! For the record, I very much appreciate your review, and am grateful for your feedback and intelligent questions. I really dislike the sort of overly-complex synchronisation I'm doing here, so I'm particularly grateful for it in this case. I would love to simplify this!

Lib/multiprocessing/popen_fork.py

YvesDup · 2025-11-28T10:19:54Z

Lib/multiprocessing/popen_forkserver.py


    def __init__(self, process_obj):
        self._fds = []
+        self._lock = threading.Lock()


What are we do here about the private _exit_condition and _exit_blockers attributes inherit from popen_fork.Popen class ? I suggest a private method in the parent class which define its own synchronize attributes, calls from the __init__.

Good question. I had intended to just ignore them. We are very fortunate the locking is entirely self-contained within each class's poll implementation, so this works just fine, but it does leave the child with unused instance variables inherited from the parent.

If I understand your suggestion (please correct me if not), you are suggesting a polymorphic method to initialise locking, so each class only creates the attributes it requires. I agree, that sounds like an improvement, thanks! I'll update the patch shortly to add it.

BTW, I think all this would be simpler and more robust if popen_forkserver.Popen did not directly inherit from popen_fork.Popen. They should probably both inherit from an abstract base class, instead. Presumably it would be impractical, if not impossible, to change that now due to backward compat concerns, though.

…o each class can do so as it requires. Co-authored-by: Duprat <yduprat@gmail.com>

duaneg requested a review from gpshead as a code owner March 19, 2025 02:06

bedevere-app bot added the awaiting review label Mar 19, 2025

bedevere-app bot mentioned this pull request Mar 19, 2025

multiprocessing.Process.is_alive() can incorrectly return True after join() #130895

Open

Add blurb

ad102f4

zmedico reviewed Nov 2, 2025

View reviewed changes

Lib/multiprocessing/popen_forkserver.py Outdated Show resolved Hide resolved

zmedico reviewed Nov 2, 2025

View reviewed changes

Lib/multiprocessing/popen_fork.py Show resolved Hide resolved

Return status code after timeout: usually this will be None, as we are

87f391f

returning now, except if we raced with another thread that set it just after our timeout expired.

duaneg added 2 commits November 3, 2025 13:07

Merge remote-tracking branch 'origin/main' into waiting/pythongh-130895

6ff7c04

Ignore fork-in-thread deprecation warnings in test, as now required

5b4dbb0

YvesDup reviewed Nov 25, 2025

View reviewed changes

Lib/multiprocessing/popen_fork.py Show resolved Hide resolved

YvesDup reviewed Nov 28, 2025

View reviewed changes

Initialise synchronisation-related attributes in polymorphic method s…

e7f7144

…o each class can do so as it requires. Co-authored-by: Duprat <yduprat@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-130895: fix multiprocessing.Process join/wait/poll races #131440

gh-130895: fix multiprocessing.Process join/wait/poll races #131440

duaneg commented Mar 19, 2025 •

edited by bedevere-app bot

Loading

Uh oh!

ghost commented Mar 19, 2025 •

edited by ghost

Loading

Uh oh!

bedevere-app bot commented Mar 19, 2025

Uh oh!

Uh oh!

Uh oh!

duaneg commented Nov 3, 2025

Uh oh!

YvesDup Nov 25, 2025

Uh oh!

duaneg Nov 25, 2025

Uh oh!

YvesDup Nov 27, 2025 •

edited

Loading

Uh oh!

duaneg Nov 29, 2025

Uh oh!

Uh oh!

YvesDup Nov 28, 2025 •

edited

Loading

Uh oh!

duaneg Nov 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

gh-130895: fix multiprocessing.Process join/wait/poll races #131440

Are you sure you want to change the base?

gh-130895: fix multiprocessing.Process join/wait/poll races #131440

Conversation

duaneg commented Mar 19, 2025 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Mar 19, 2025 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bedevere-app bot commented Mar 19, 2025

Uh oh!

Uh oh!

Uh oh!

duaneg commented Nov 3, 2025

Uh oh!

YvesDup Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

duaneg Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

YvesDup Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duaneg Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

YvesDup Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duaneg Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

duaneg commented Mar 19, 2025 •

edited by bedevere-app bot

Loading

ghost commented Mar 19, 2025 •

edited by ghost

Loading

YvesDup Nov 27, 2025 •

edited

Loading

YvesDup Nov 28, 2025 •

edited

Loading