Skip to content

Conversation

@Fidget-Spinner
Copy link
Member

@Fidget-Spinner Fidget-Spinner commented Nov 15, 2025

This PR gives the JIT free-threading support. It is only on for single-threaded code, and turns off automatically on multi-threaded code. All JIT features are turned on, including the optimizer. All tests pass on my system, including TSAN as of 527aac1 usng the FT suppression file, except for the usual spurious race conditions already in existing CPython.

Benchmark results are good. Geomean of 2% faster on FT+JIT vs just FT. So the JIT provides a 2% geomean speedup on FT for pyperformance! Compare https://github.com/facebookexperimental/free-threading-benchmarking/tree/main/results/bm-20251116-3.15.0a1+-763ebea-JIT,NOGIL against https://github.com/facebookexperimental/free-threading-benchmarking/tree/main/results/bm-20251115-3.15.0a1+-ed73c90-NOGIL. For convenience, I downloaded the json files and pyperf compared them here https://gist.github.com/Fidget-Spinner/d79f2f050147fc355a300bf9d413a75e

Design:
Creation of >1 threads cause global invalidation of all executors and disables JIT.

In the future, we can do lock removal too. The performance figures from the lock removal are promising (no lock removal vs lock removal). All with FT+PGO+LTO=thin+TC+JIT + pyperf system tune:

Mean +- std dev: [nogil-float-lock] 69.9 ms +- 1.5 ms -> [nogil-float-nolock] 54.9 ms +- 0.8 ms: 1.27x faster
Mean +- std dev: [nogil-nbody-lock] 139 ms +- 1 ms -> [nogil-nbody-nolock] 115 ms +- 3 ms: 1.21x faster
Mean +- std dev: [nogil-richards-lock] 37.3 ms +- 0.4 ms -> [nogil-richards-nolock] 35.8 ms +- 0.3 ms: 1.04x faster 
Mean +- std dev: [nogil-deltablue-lock] 3.50 ms +- 0.15 ms -> [nogil-deltablue-nolock] 3.36 ms +- 0.14 ms: 1.04x faster

This is on a platform where locking/atomics are somewhat slow (i7-12700h). I removed the lock removal code for this PR to reduce the diff.

@Fidget-Spinner
Copy link
Member Author

Fidget-Spinner commented Nov 15, 2025

@sergey-miryanov in the future could you please bundle up your reviews and send them in one review instead of multiple reviews?

@corona10 corona10 self-requested a review November 16, 2025 04:20
PyInterpreterState *interp = _PyInterpreterState_GET();
if (!interp->jit) {
// gh-140936: It is possible that interp->jit will become false during
if (!FT_ATOMIC_LOAD_CHAR_RELAXED(_tstate->jit_executor_state.jit)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Do we need to use atomic operation in here? even for the per-thread state?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, we need this. Because we set jit to false in the add_threadstate and we do it for all threads.

@markshannon
Copy link
Member

This PR has some serious issues:

  • Moving all the JIT state to the thread does not make things thread safe. It might do the opposite by introducing race conditions.
  • There is a possible use after free bug, when an executor created by one thread is run in another
  • It leaks an unbounded amount of memory, as it doesn't collect executors.
  • It waste memory allocating cold executors per thread

Executors are not "effectively immortal"; cold executors do get freed.

There are some useful improvements here, such as locking when inserting/removing executors from the linked list, but those should be in separate small PRs.

Finally, we really need to clean up after adding the tracing JIT frontend before making any intrusive changes like this.

@Fidget-Spinner
Copy link
Member Author

Fidget-Spinner commented Nov 17, 2025

  • It leaks an unbounded amount of memory, as it doesn't collect executors.

This isn't true? The GC is now responsible for collecting executors.

Executors are not "effectively immortal"; cold executors do get freed.

Let me rephrase, they are effectively GC-only, not refcounted. As they always form reference cycles. So there's no point refcounting them. Just leave it to the GC.

Finally, we really need to clean up after adding the tracing JIT frontend before making any intrusive changes like this.

Yeap that I agree on.

@Fidget-Spinner
Copy link
Member Author

@markshannon I reduced the diff to the minimal changeset just to get everything working. This is just +50 lines of code roughly now!

{
PyInterpreterState *interp = PyInterpreterState_Get();
_Py_Executors_InvalidateDependency(interp, obj, 1);
_Py_Executors_InvalidateDependency(_PyInterpreterState_GET(), obj, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be reverted too.

}

PyFunctionObject *
_PyFunction_LookupByVersion(uint32_t version, PyObject **p_code)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function really used somewhere?
Maybe we should remove it on main in a separate PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's strange, you're right, it's not used.

It's useful for the optimizer though, so we might want to use it in the future.

Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have time for a detailed review ATM, but I've a few comments

// based on the version, so we do not need to stop the world to set it.
func->func_version = version;
#ifndef Py_GIL_DISABLED
#if _Py_TIER2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the tier 1 with-gil behavior.
I don't see why we would need the cache in tier 1, but can you double check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I double checked.

uint32_t
_PyFunction_GetVersionForCurrentState(PyFunctionObject *func)
{
// This function does not need locking/atomics as it can only be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the function version in specialized instructions, so it is used in tier 1.
I'm surprised this isn't already synchronized.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use in the specializer is protected because it's accessing a stack variable which holds a strong reference to it. I will clarify that the unsnchronized use is only in the optimizer and it's fine there.

_PyCode_Clear_Executors(code);
}
_Py_Executors_InvalidateDependency(interp, code, 1);
_PyJit_Tracer_InvalidateDependency(PyThreadState_GET(), code);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged it into _Py_Executors_InvalidateDependency because it's actually always called when we invalidate executor dependency. So logically, it should be part of it.

PyObject *lookup = _PyType_Lookup(type, name);
if (lookup) {
int opcode = _Py_IsImmortal(lookup) ? immortal : mortal;
int opcode = _Py_IsImmortal(lookup) || _PyObject_HasDeferredRefcount(lookup) ? immortal : mortal;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens to be correct, because the uop we use for immortal objects tags the pointer.
But in theory, we could not tag the pointer and the code would be correct for immortal objects but not those with deferred refcounts.

However, it is always safe to use a tagged pointer for an immortal object.
So, can you rename immortal to deferred_count or similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants