-
-
Notifications
You must be signed in to change notification settings - Fork 33.5k
gh-130895: fix multiprocessing.Process join/wait/poll races #131440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This bug is caused by race conditions in the poll implementations (which are called by join/wait) where if multiple threads try to reap the dead process only one "wins" and gets the exit code, while the others get an error. In the forkserver implementation the losing thread(s) set the code to an error, possibly overwriting the correct code set by the winning thread. This is relatively easy to fix: we can just take a lock before waiting for the process, since at that point we know the call should not block. In the fork and spawn implementations the losers of the race return before the exit code is set, meaning the process may still report itself as alive after join returns. Fixing this is trickier as we have to support a mixture of blocking and non-blocking calls to poll, and we cannot have the latter waiting to take a lock held by the former. The approach taken is to split the blocking and non-blocking call variants. The non-blocking variant does its work with the lock held: since it won't block this should be safe. The blocking variant releases the lock before making the blocking operating system call. It then retakes the lock and either sets the code if it wins or waits for a potentially racing thread to do so otherwise. If a non-blocking call is racing with the unlocked part of a blocking call it may still "lose" the race, and return None instead of the exit code, even though the process is dead. However, as the process could be alive at the time the call is made but die immediately afterwards, this situation should already be handled by correctly written code. To verify the behaviour a test is added which reliably triggers failures for all three implementations. A work-around for this bug in a test added for pythongh-128041 is also reverted.
|
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
returning now, except if we raced with another thread that set it just after our timeout expired.
|
It might be a good idea to open a PR with just the fix for @zmedico if you want to do that, please go ahead, and feel free to copy the unit test from this PR if it is helpful 🙂 |
|
|
||
| with self._exit_condition: | ||
| self._exit_blockers -= 1 | ||
| if pid == self.pid: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about addding the following test, just between the increment and the pid test:
if self.returncode is not None:
return self.returncode
I suggest this change because while the condition RLock was released, the self.returncode attribute could have been updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, but it is very possible I've missed something, this is all (way too) subtle...
If the thread has won the race and os.waitpid returned the exit code then the returncode cannot have been set: only one thread will get a valid exit code, so if this one has it no other one can. In that case the if would always fail.
If the thread did not win the race then the winner might have set it, but in that case it isn't enough to just exit immediately. We need to notify in case there was another blocked thread which didn't see the returncode and is waiting.
I.e. consider the following scenario:
- Threads 1 & 2 are blocked doing an
os.waitpidinpoll. - Thread 3 does a non-blocking
pollwhich "wins" and gets the status code (but has not yet set it!) - Thread 1 wakes up. It has lost the race,
returncodehasn't yet been set, and there is another blocker, so it goes into thewhileloop and waits. - Thread 3 sets the
returncode. - Thread 2 wakes up. If we add the
ifit will exit without notifying thread 1, which will then be blocked indefinitely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the thread has won the race and os.waitpid returned the exit code then the returncode cannot have been set:
Are you describe a case of blocking poll ?
In case of non-blocking poll, the self._set_returncode method is always called in a protected block via the condition RLock. The set and notify operations are always executed together in the self._set_returncode method .
only one thread will get a valid exit code, so if this one has it no other one can.
I agree
In that case the if would always fail.
Yes it would failed, but only for the current thread which will continue to run. And the next threads, should not be failed if self.return code is set correctly.
If the thread did not win the race then the winner might have set it, but in that case it isn't enough to just exit immediately. We need to notify in case there was another blocked thread which didn't see the returncode and is waiting.
IMO the notification was already sent from the last call to theself._set_returncode method. Even though I have my doubts, if a new notification is really necessary - when self._exit_blockers is zero ? - the new if is no longer a relevant option.
2. Thread 3 does a non-blocking
pollwhich "wins" and gets the status code (but has not yet set it!)
Regarding my first command, I guess yout talk about a blocking poll.
When I ran your test_racing_joins test with the new if, all seems okay, I never noticed problem with missing notifications. We can check together if you wish.
Your fix seems okay to me. I was just wondering if it was possible to avoid executing the second protected block entirely.
Thank you for your time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the thread did not win the race then the winner might have set it, but in that case it isn't enough to just exit immediately. We need to notify in case there was another blocked thread which didn't see the returncode and is waiting.
IMO the notification was already sent from the last call to the
self._set_returncodemethod. Even though I have my doubts, if a new notification is really necessary - whenself._exit_blockersis zero ? - the newifis no longer a relevant option.
You make a good point, and I made a mistake in my analysis above. The early exit you suggest is indeed safe if another racing poll has won and set the return code, since they will have also notified. The problematic case is the pathological one, where some other code entirely called os.waitpid and won the race. In that case _set_returncode is not called and hence we need that backup notification to avoid a hang.
In that scenario the Popen object is broken, since it has no way to get the child's exit code. Methods like is_alive() and mp mechanisms downstream of Popen will give the wrong result (there are a bunch of open issues that boil down to this). It is a "user error" situation, but we should still try and make everything fail as gracefully as possible (i.e. not hang 😉).
So, we could put the early exit in. Ultimately I think the code will behave correctly with or without it. Personally, I think the code reads slightly simpler without it, just in basic terms of fewer conditionals/less cyclomatic complexity, but YMMV on that. If there is a general consensus otherwise I'll happily add it.
When I ran your
test_racing_joinstest with the newif, all seems okay, I never noticed problem with missing notifications. We can check together if you wish.
Thanks, I would love to collaborate on this! The existing test case does not test the pathological behaviour, so it will indeed miss this. Here is a new test script that does; it hangs reliably for me if the last blocker notification is removed:
import multiprocessing as mp
import os
import threading
N=8
def wait(barrier, p):
barrier.wait()
p.join()
# WRONG due to losing the child's status
assert p.exitcode is None
assert p.is_alive()
def race():
pbarrier = mp.Barrier(2)
# Ensure child has completed
p = mp.Process(target=pbarrier.wait, args=(pbarrier,))
p.start()
pbarrier.wait()
# Steal the child's exit code
pid, sts = os.waitpid(p.pid, 0)
assert pid == p.pid
assert sts == 0
# Since the child is dead, these should not block...
tbarrier = threading.Barrier(N)
threads = [threading.Thread(target=wait, args=(tbarrier, p)) for _ in range(N)]
for t in threads:
t.start()
for t in threads:
t.join()
if __name__ == "__main__":
mp.set_start_method('fork')
race()Does that reproduce for you? I guess it should be tidied up and turned into a unit test...
Your fix seems okay to me. I was just wondering if it was possible to avoid executing the second protected block entirely. Thank you for your time.
Thanks! For the record, I very much appreciate your review, and am grateful for your feedback and intelligent questions. I really dislike the sort of overly-complex synchronisation I'm doing here, so I'm particularly grateful for it in this case. I would love to simplify this!
|
|
||
| def __init__(self, process_obj): | ||
| self._fds = [] | ||
| self._lock = threading.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are we do here about the private _exit_condition and _exit_blockers attributes inherit from popen_fork.Popen class ? I suggest a private method in the parent class which define its own synchronize attributes, calls from the __init__.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I had intended to just ignore them. We are very fortunate the locking is entirely self-contained within each class's poll implementation, so this works just fine, but it does leave the child with unused instance variables inherited from the parent.
If I understand your suggestion (please correct me if not), you are suggesting a polymorphic method to initialise locking, so each class only creates the attributes it requires. I agree, that sounds like an improvement, thanks! I'll update the patch shortly to add it.
BTW, I think all this would be simpler and more robust if popen_forkserver.Popen did not directly inherit from popen_fork.Popen. They should probably both inherit from an abstract base class, instead. Presumably it would be impractical, if not impossible, to change that now due to backward compat concerns, though.
…o each class can do so as it requires. Co-authored-by: Duprat <yduprat@gmail.com>
This bug is caused by race conditions in the poll implementations (which are called by join/wait) where if multiple threads try to reap the dead process only one "wins" and gets the exit code, while the others get an error.
In the forkserver implementation the losing thread(s) set the code to an error, possibly overwriting the correct code set by the winning thread. This is relatively easy to fix: we can just take a lock before waiting for the process, since at that point we know the call should not block.
In the fork and spawn implementations the losers of the race return before the exit code is set, meaning the process may still report itself as alive after join returns. Fixing this is trickier as we have to support a mixture of blocking and non-blocking calls to poll, and we cannot have the latter waiting to take a lock held by the former.
The approach taken is to split the blocking and non-blocking call variants. The non-blocking variant does its work with the lock held: since it won't block this should be safe. The blocking variant releases the lock before making the blocking operating system call. It then retakes the lock and either sets the code if it wins or waits for a potentially racing thread to do so otherwise.
If a non-blocking call is racing with the unlocked part of a blocking call it may still "lose" the race, and return None instead of the exit code, even though the process is dead. However, as the process could be alive at the time the call is made but die immediately afterwards, this situation should already be handled by correctly written code.
To verify the behaviour a test is added which reliably triggers failures for all three implementations. A work-around for this bug in a test added for gh-128041 is also reverted.
multiprocessing.Process.is_alive()can incorrectly return True afterjoin()#130895