gh-141594: A free-threaded JIT #141595

Fidget-Spinner · 2025-11-15T16:28:35Z

This PR gives the JIT free-threading support. It is only on for single-threaded code, and turns off automatically on multi-threaded code. All JIT features are turned on, including the optimizer. All tests pass on my system, including TSAN as of 527aac1 usng the FT suppression file, except for the usual spurious race conditions already in existing CPython.

Benchmark results are good. Geomean of 2% faster on FT+JIT vs just FT. So the JIT provides a 2% geomean speedup on FT for pyperformance! Compare https://github.com/facebookexperimental/free-threading-benchmarking/tree/main/results/bm-20251116-3.15.0a1+-763ebea-JIT,NOGIL against https://github.com/facebookexperimental/free-threading-benchmarking/tree/main/results/bm-20251115-3.15.0a1+-ed73c90-NOGIL. For convenience, I downloaded the json files and pyperf compared them here https://gist.github.com/Fidget-Spinner/d79f2f050147fc355a300bf9d413a75e

Design:
Creation of >1 threads cause global invalidation of all executors and disables JIT.

In the future, we can do lock removal too. The performance figures from the lock removal are promising (no lock removal vs lock removal). All with FT+PGO+LTO=thin+TC+JIT + pyperf system tune:

Mean +- std dev: [nogil-float-lock] 69.9 ms +- 1.5 ms -> [nogil-float-nolock] 54.9 ms +- 0.8 ms: 1.27x faster
Mean +- std dev: [nogil-nbody-lock] 139 ms +- 1 ms -> [nogil-nbody-nolock] 115 ms +- 3 ms: 1.21x faster
Mean +- std dev: [nogil-richards-lock] 37.3 ms +- 0.4 ms -> [nogil-richards-nolock] 35.8 ms +- 0.3 ms: 1.04x faster 
Mean +- std dev: [nogil-deltablue-lock] 3.50 ms +- 0.15 ms -> [nogil-deltablue-nolock] 3.36 ms +- 0.14 ms: 1.04x faster

This is on a platform where locking/atomics are somewhat slow (i7-12700h). I removed the lock removal code for this PR to reduce the diff.

Issue: JIT x Free-Threading #141594

Lib/test/test_capi/test_opt.py

Objects/listobject.c

Python/bytecodes.c

Python/pylifecycle.c

Python/sysmodule.c

Fidget-Spinner · 2025-11-15T20:04:46Z

@sergey-miryanov in the future could you please bundle up your reviews and send them in one review instead of multiple reviews?

Misc/NEWS.d/next/Core_and_Builtins/2025-11-15-16-30-46.gh-issue-141594.PSsC5J.rst

Python/pylifecycle.c

corona10 · 2025-11-16T04:30:43Z

Python/optimizer.c

-    PyInterpreterState *interp = _PyInterpreterState_GET();
-    if (!interp->jit) {
-        // gh-140936: It is possible that interp->jit will become false during
+    if (!FT_ATOMIC_LOAD_CHAR_RELAXED(_tstate->jit_executor_state.jit)) {


nit: Do we need to use atomic operation in here? even for the per-thread state?

IIUC, we need this. Because we set jit to false in the add_threadstate and we do it for all threads.

markshannon · 2025-11-17T11:52:14Z

This PR has some serious issues:

Moving all the JIT state to the thread does not make things thread safe. It might do the opposite by introducing race conditions.
There is a possible use after free bug, when an executor created by one thread is run in another
It leaks an unbounded amount of memory, as it doesn't collect executors.
It waste memory allocating cold executors per thread

Executors are not "effectively immortal"; cold executors do get freed.

There are some useful improvements here, such as locking when inserting/removing executors from the linked list, but those should be in separate small PRs.

Finally, we really need to clean up after adding the tracing JIT frontend before making any intrusive changes like this.

Fidget-Spinner · 2025-11-17T11:56:45Z

It leaks an unbounded amount of memory, as it doesn't collect executors.

This isn't true? The GC is now responsible for collecting executors.

Executors are not "effectively immortal"; cold executors do get freed.

Let me rephrase, they are effectively GC-only, not refcounted. As they always form reference cycles. So there's no point refcounting them. Just leave it to the GC.

Finally, we really need to clean up after adding the tracing JIT frontend before making any intrusive changes like this.

Yeap that I agree on.

Fidget-Spinner · 2025-11-17T12:51:25Z

@markshannon I reduced the diff to the minimal changeset just to get everything working. This is just +50 lines of code roughly now!

efimov-mikhail · 2025-11-17T12:56:43Z

Modules/_testinternalcapi.c

 {
-    PyInterpreterState *interp = PyInterpreterState_Get();
-    _Py_Executors_InvalidateDependency(interp, obj, 1);
+    _Py_Executors_InvalidateDependency(_PyInterpreterState_GET(), obj, 1);


This can be reverted too.

Objects/codeobject.c

efimov-mikhail · 2025-11-17T13:23:22Z

Objects/funcobject.c

 }

 PyFunctionObject *
 _PyFunction_LookupByVersion(uint32_t version, PyObject **p_code)


Is this function really used somewhere?
Maybe we should remove it on main in a separate PR?

That's strange, you're right, it's not used.

It's useful for the optimizer though, so we might want to use it in the future.

Objects/typeobject.c

markshannon

I don't have time for a detailed review ATM, but I've a few comments

markshannon · 2025-11-17T13:08:10Z

Objects/funcobject.c

    // based on the version, so we do not need to stop the world to set it.
    func->func_version = version;
-#ifndef Py_GIL_DISABLED
+#if _Py_TIER2


This changes the tier 1 with-gil behavior.
I don't see why we would need the cache in tier 1, but can you double check?

Yes I double checked.

markshannon · 2025-11-17T13:12:31Z

Objects/funcobject.c

 uint32_t
 _PyFunction_GetVersionForCurrentState(PyFunctionObject *func)
 {
+    // This function does not need locking/atomics as it can only be


We use the function version in specialized instructions, so it is used in tier 1.
I'm surprised this isn't already synchronized.

The use in the specializer is protected because it's accessing a stack variable which holds a strong reference to it. I will clarify that the unsnchronized use is only in the optimizer and it's fine there.

markshannon · 2025-11-17T13:19:19Z

Python/instrumentation.c

        _PyCode_Clear_Executors(code);
    }
    _Py_Executors_InvalidateDependency(interp, code, 1);
-    _PyJit_Tracer_InvalidateDependency(PyThreadState_GET(), code);


Why is this removed?

I merged it into _Py_Executors_InvalidateDependency because it's actually always called when we invalidate executor dependency. So logically, it should be part of it.

markshannon · 2025-11-17T13:34:21Z

Python/optimizer_analysis.c

        PyObject *lookup = _PyType_Lookup(type, name);
        if (lookup) {
-            int opcode = _Py_IsImmortal(lookup) ? immortal : mortal;
+            int opcode = _Py_IsImmortal(lookup) || _PyObject_HasDeferredRefcount(lookup) ? immortal : mortal;


This happens to be correct, because the uop we use for immortal objects tags the pointer.
But in theory, we could not tag the pointer and the code would be correct for immortal objects but not those with deferred refcounts.

However, it is always safe to use a tagged pointer for an immortal object.
So, can you rename immortal to deferred_count or similar.

Fidget-Spinner added 15 commits November 13, 2025 21:17

Move all JIT fields to thread state

6a3bd75

fix a bug with traversing states

ba9a65a

fix JIT invalidation mechanism for FT

725894d

fix re-entrant finalizers

1e5713d

Re-enable the JIT

f0d4c57

cleanup a little

67de7d6

fix weird GC bugs

7f0bc57

re-enable jit on some stuff

f05e61c

fix test, more locks!

f78e8c8

fix JIT builds

e1f1b30

only clear at end

53c5e1d

remove locks in JIT code

cc38ee4

fix a few bugs

f8fefb3

Improve tracer thread safety

d76a24b

set immortal before GC

fa99108

Fidget-Spinner requested review from ZeroIntensity, ericsnowcurrently and markshannon as code owners November 15, 2025 16:28

bedevere-app bot added the awaiting core review label Nov 15, 2025

bedevere-app bot mentioned this pull request Nov 15, 2025

JIT x Free-Threading #141594

Open

blurb-it bot and others added 2 commits November 15, 2025 16:30

📜🤖 Added by blurb_it.

fcfed96

fix default builkd

ba67ab7