-
-
Notifications
You must be signed in to change notification settings - Fork 33.6k
GH-135379: Top of stack caching for the JIT. #135465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Fidget-Spinner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really cool. I'll do a full review soon enough.
78489ea to
2850d72
Compare
|
Performance is in the noise, but we would need a really big speed up of jitted code for it to be more than noise overall. The nbody benchmark, which spends a lot of time in the JIT shows a 13-18% speedup, except on Mac where it shows no speedup. |
Nice. We use Apple's Compiler for the interpreter, though the JIT uses stock LLVm. Thomas previously showed that the version of the Apple compiler we use is subject to huge fluctuations in performance due to a PGO bug. |
Misc/NEWS.d/next/Core_and_Builtins/2025-06-20-16-03-59.gh-issue-135379.eDg89T.rst
Outdated
Show resolved
Hide resolved
Fidget-Spinner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to review the cases generator later.
Misc/NEWS.d/next/Core_and_Builtins/2025-06-13-13-32-16.gh-issue-135379.pAxZgy.rst
Outdated
Show resolved
Hide resolved
colesbury
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment below. The _LOAD_ATTR changes look fine to me.
|
Latest bechmarking results: https://github.com/faster-cpython/benchmarking-public/tree/main/results/bm-20250813-3.15.0a0-6a85f95-JIT 2.6% faster on Windows. No change on Linux. It looks like coverage is slower on linux, which is presumably some sort of artifact as the coverage benchmark does lots of instrumentation which prevents the JIT form running (Plus the coverage benchmark uses an old version of coverage, the latest version is much faster). |
|
The tail calling CI seems to be failing because homebrew changed where they install clang (yet again). Will put up a separate PR to fix that. |
|
Ok, I fixed the macOS CI on main. Please pull the changes in. |
|
I thought that caching through side exits would speed things up, but it looks like it slows things down a bit if anything. So, I've reverted that change. Will rerun the benchmarks, to confirm... |
|
I was hoping to get a clear cross-the-board speedup before merging this. |
savannahostrowski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a quick skim of this - very neat! Thanks for also including all the perf numbers in the discussion, which helps counteract some of the initial trepidation I had around the amount of new generated code. This lays pretty solid groundwork for future optimizations as well.
Just one comment about the type change in pycore_optimizer.h
| uint8_t chain_depth:6; // Must be big enough for MAX_CHAIN_DEPTH - 1. | ||
| bool warm; | ||
| int index; // Index of ENTER_EXECUTOR (if code isn't NULL, below). | ||
| int16_t index; // Index of ENTER_EXECUTOR (if code isn't NULL, below). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this intentional? In practice, functions are very small and should probably never exceed this, but do we want a check here or some validation? I think this could overflow silently, right?
|
|
||
| #Simple heuristic for size to avoid too much stencil duplication | ||
| def is_large(uop: Uop) -> bool: | ||
| return len(list(uop.body.tokens())) > 80 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too low, to be useful. For example, _BINARY_OP_ADD_FLOAT is 89 tokens. It should be register allocated, but instead it spills all the time because it's limited to r2_1. I've inspected the code for nbody and it spills unecessarily. We should up the limit.
I suggest increasing the limit to 150.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some benchmark results for nbody tracing jit vs nbody tracing jit + regalloc after increasing the limit. Base commits are a few apart but they have no JIT related changes in them:
Mean +- std dev: [nbody_tracing] 107 ms +- 2 ms -> [nbody_tracing_regalloc] 78.8 ms +- 0.6 ms: 1.36x faster
36% faster. That's significantly higher than the 17-27% we've been getting.
|
When you're done making the requested changes, leave the comment: |
|
Once we get Savannah's LLVM 21 PR in, we should experiment with setting the TOS cache size to 4. I observe a lot of spilling due to the loop iterator taking up some space. |
|
I think we will want to vary the cache depending on both hardware and operating system. All that for later PR(s) though. |
The stats need fixing and the generated tables could be more compact, but it works.