Skip to content

Conversation

@duaneg
Copy link
Contributor

@duaneg duaneg commented Mar 19, 2025

This bug is caused by race conditions in the poll implementations (which are called by join/wait) where if multiple threads try to reap the dead process only one "wins" and gets the exit code, while the others get an error.

In the forkserver implementation the losing thread(s) set the code to an error, possibly overwriting the correct code set by the winning thread. This is relatively easy to fix: we can just take a lock before waiting for the process, since at that point we know the call should not block.

In the fork and spawn implementations the losers of the race return before the exit code is set, meaning the process may still report itself as alive after join returns. Fixing this is trickier as we have to support a mixture of blocking and non-blocking calls to poll, and we cannot have the latter waiting to take a lock held by the former.

The approach taken is to split the blocking and non-blocking call variants. The non-blocking variant does its work with the lock held: since it won't block this should be safe. The blocking variant releases the lock before making the blocking operating system call. It then retakes the lock and either sets the code if it wins or waits for a potentially racing thread to do so otherwise.

If a non-blocking call is racing with the unlocked part of a blocking call it may still "lose" the race, and return None instead of the exit code, even though the process is dead. However, as the process could be alive at the time the call is made but die immediately afterwards, this situation should already be handled by correctly written code.

To verify the behaviour a test is added which reliably triggers failures for all three implementations. A work-around for this bug in a test added for gh-128041 is also reverted.

This bug is caused by race conditions in the poll implementations (which are
called by join/wait) where if multiple threads try to reap the dead process
only one "wins" and gets the exit code, while the others get an error.

In the forkserver implementation the losing thread(s) set the code to an error,
possibly overwriting the correct code set by the winning thread. This is
relatively easy to fix: we can just take a lock before waiting for the process,
since at that point we know the call should not block.

In the fork and spawn implementations the losers of the race return before the
exit code is set, meaning the process may still report itself as alive after
join returns. Fixing this is trickier as we have to support a mixture of
blocking and non-blocking calls to poll, and we cannot have the latter waiting
to take a lock held by the former.

The approach taken is to split the blocking and non-blocking call variants. The
non-blocking variant does its work with the lock held: since it won't block
this should be safe. The blocking variant releases the lock before making the
blocking operating system call. It then retakes the lock and either sets the
code if it wins or waits for a potentially racing thread to do so otherwise.

If a non-blocking call is racing with the unlocked part of a blocking call it
may still "lose" the race, and return None instead of the exit code, even
though the process is dead. However, as the process could be alive at the time
the call is made but die immediately afterwards, this situation should already
be handled by correctly written code.

To verify the behaviour a test is added which reliably triggers failures for
all three implementations. A work-around for this bug in a test added for
pythongh-128041 is also reverted.
@duaneg duaneg requested a review from gpshead as a code owner March 19, 2025 02:06
@ghost
Copy link

ghost commented Mar 19, 2025

All commit authors signed the Contributor License Agreement.
CLA signed

@bedevere-app
Copy link

bedevere-app bot commented Mar 19, 2025

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

returning now, except if we raced with another thread that set it just after
our timeout expired.
@duaneg
Copy link
Contributor Author

duaneg commented Nov 3, 2025

It might be a good idea to open a PR with just the fix for forkserver under #140867. That issue describes the problem with forkserver better, and the fix for forkserver is much simpler and safer than the fix for fork/spawn. It makes sense to treat them separately.

@zmedico if you want to do that, please go ahead, and feel free to copy the unit test from this PR if it is helpful 🙂


with self._exit_condition:
self._exit_blockers -= 1
if pid == self.pid:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about addding the following test, just between the increment and the pid test:

            if self.returncode is not None:
                return self.returncode

I suggest this change because while the condition RLock was released, the self.returncode attribute could have been updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, but it is very possible I've missed something, this is all (way too) subtle...

If the thread has won the race and os.waitpid returned the exit code then the returncode cannot have been set: only one thread will get a valid exit code, so if this one has it no other one can. In that case the if would always fail.

If the thread did not win the race then the winner might have set it, but in that case it isn't enough to just exit immediately. We need to notify in case there was another blocked thread which didn't see the returncode and is waiting.

I.e. consider the following scenario:

  1. Threads 1 & 2 are blocked doing an os.waitpid in poll.
  2. Thread 3 does a non-blocking poll which "wins" and gets the status code (but has not yet set it!)
  3. Thread 1 wakes up. It has lost the race, returncode hasn't yet been set, and there is another blocker, so it goes into the while loop and waits.
  4. Thread 3 sets the returncode.
  5. Thread 2 wakes up. If we add the if it will exit without notifying thread 1, which will then be blocked indefinitely.

Copy link
Contributor

@YvesDup YvesDup Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the thread has won the race and os.waitpid returned the exit code then the returncode cannot have been set:

Are you describe a case of blocking poll ?
In case of non-blocking poll, the self._set_returncode method is always called in a protected block via the condition RLock. The set and notify operations are always executed together in the self._set_returncode method .

only one thread will get a valid exit code, so if this one has it no other one can.

I agree

In that case the if would always fail.

Yes it would failed, but only for the current thread which will continue to run. And the next threads, should not be failed if self.return code is set correctly.

If the thread did not win the race then the winner might have set it, but in that case it isn't enough to just exit immediately. We need to notify in case there was another blocked thread which didn't see the returncode and is waiting.

IMO the notification was already sent from the last call to theself._set_returncode method. Even though I have my doubts, if a new notification is really necessary - when self._exit_blockers is zero ? - the new if is no longer a relevant option.

2. Thread 3 does a non-blocking poll which "wins" and gets the status code (but has not yet set it!)

Regarding my first command, I guess yout talk about a blocking poll.

When I ran your test_racing_joins test with the new if, all seems okay, I never noticed problem with missing notifications. We can check together if you wish.

Your fix seems okay to me. I was just wondering if it was possible to avoid executing the second protected block entirely.
Thank you for your time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the thread did not win the race then the winner might have set it, but in that case it isn't enough to just exit immediately. We need to notify in case there was another blocked thread which didn't see the returncode and is waiting.

IMO the notification was already sent from the last call to theself._set_returncode method. Even though I have my doubts, if a new notification is really necessary - when self._exit_blockers is zero ? - the new if is no longer a relevant option.

You make a good point, and I made a mistake in my analysis above. The early exit you suggest is indeed safe if another racing poll has won and set the return code, since they will have also notified. The problematic case is the pathological one, where some other code entirely called os.waitpid and won the race. In that case _set_returncode is not called and hence we need that backup notification to avoid a hang.

In that scenario the Popen object is broken, since it has no way to get the child's exit code. Methods like is_alive() and mp mechanisms downstream of Popen will give the wrong result (there are a bunch of open issues that boil down to this). It is a "user error" situation, but we should still try and make everything fail as gracefully as possible (i.e. not hang 😉).

So, we could put the early exit in. Ultimately I think the code will behave correctly with or without it. Personally, I think the code reads slightly simpler without it, just in basic terms of fewer conditionals/less cyclomatic complexity, but YMMV on that. If there is a general consensus otherwise I'll happily add it.

When I ran your test_racing_joins test with the new if, all seems okay, I never noticed problem with missing notifications. We can check together if you wish.

Thanks, I would love to collaborate on this! The existing test case does not test the pathological behaviour, so it will indeed miss this. Here is a new test script that does; it hangs reliably for me if the last blocker notification is removed:

import multiprocessing as mp
import os
import threading

N=8

def wait(barrier, p):
    barrier.wait()
    p.join()

    # WRONG due to losing the child's status
    assert p.exitcode is None
    assert p.is_alive()

def race():
    pbarrier = mp.Barrier(2)

    # Ensure child has completed
    p = mp.Process(target=pbarrier.wait, args=(pbarrier,))
    p.start()
    pbarrier.wait()

    # Steal the child's exit code
    pid, sts = os.waitpid(p.pid, 0)
    assert pid == p.pid
    assert sts == 0

    # Since the child is dead, these should not block...
    tbarrier = threading.Barrier(N)
    threads = [threading.Thread(target=wait, args=(tbarrier, p)) for _ in range(N)]
    for t in threads:
        t.start()
    for t in threads:
        t.join()

if __name__ == "__main__":
    mp.set_start_method('fork')
    race()

Does that reproduce for you? I guess it should be tidied up and turned into a unit test...

Your fix seems okay to me. I was just wondering if it was possible to avoid executing the second protected block entirely. Thank you for your time.

Thanks! For the record, I very much appreciate your review, and am grateful for your feedback and intelligent questions. I really dislike the sort of overly-complex synchronisation I'm doing here, so I'm particularly grateful for it in this case. I would love to simplify this!


def __init__(self, process_obj):
self._fds = []
self._lock = threading.Lock()
Copy link
Contributor

@YvesDup YvesDup Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we do here about the private _exit_condition and _exit_blockers attributes inherit from popen_fork.Popen class ? I suggest a private method in the parent class which define its own synchronize attributes, calls from the __init__.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I had intended to just ignore them. We are very fortunate the locking is entirely self-contained within each class's poll implementation, so this works just fine, but it does leave the child with unused instance variables inherited from the parent.

If I understand your suggestion (please correct me if not), you are suggesting a polymorphic method to initialise locking, so each class only creates the attributes it requires. I agree, that sounds like an improvement, thanks! I'll update the patch shortly to add it.

BTW, I think all this would be simpler and more robust if popen_forkserver.Popen did not directly inherit from popen_fork.Popen. They should probably both inherit from an abstract base class, instead. Presumably it would be impractical, if not impossible, to change that now due to backward compat concerns, though.

…o each

class can do so as it requires.

Co-authored-by: Duprat <yduprat@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants