Skip to content

Conversation

@markshannon
Copy link
Member

@markshannon markshannon commented Jun 13, 2025

The stats need fixing and the generated tables could be more compact, but it works.

Copy link
Member

@Fidget-Spinner Fidget-Spinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool. I'll do a full review soon enough.

@markshannon
Copy link
Member Author

Performance is in the noise, but we would need a really big speed up of jitted code for it to be more than noise overall.

The nbody benchmark, which spends a lot of time in the JIT shows a 13-18% speedup, except on Mac where it shows no speedup.
I don't know why that would be as I think we are using stock LLVM for Mac, not the Apple compiler.

@Fidget-Spinner
Copy link
Member

The nbody benchmark, which spends a lot of time in the JIT shows a 13-18% speedup, except on Mac where it shows no speedup. I don't know why that would be as I think we are using stock LLVM for Mac, not the Apple compiler.

Nice. We use Apple's Compiler for the interpreter, though the JIT uses stock LLVm. Thomas previously showed that the version of the Apple compiler we use is subject to huge fluctuations in performance due to a PGO bug.

@markshannon markshannon marked this pull request as ready for review June 20, 2025 15:04
Copy link
Member

@Fidget-Spinner Fidget-Spinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to review the cases generator later.

Copy link
Contributor

@colesbury colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment below. The _LOAD_ATTR changes look fine to me.

@markshannon
Copy link
Member Author

Latest bechmarking results: https://github.com/faster-cpython/benchmarking-public/tree/main/results/bm-20250813-3.15.0a0-6a85f95-JIT

2.6% faster on Windows. No change on Linux.

It looks like coverage is slower on linux, which is presumably some sort of artifact as the coverage benchmark does lots of instrumentation which prevents the JIT form running (Plus the coverage benchmark uses an old version of coverage, the latest version is much faster).

@Fidget-Spinner
Copy link
Member

The tail calling CI seems to be failing because homebrew changed where they install clang (yet again). Will put up a separate PR to fix that.

@Fidget-Spinner
Copy link
Member

Ok, I fixed the macOS CI on main. Please pull the changes in.

@markshannon
Copy link
Member Author

I thought that caching through side exits would speed things up, but it looks like it slows things down a bit if anything.
https://github.com/faster-cpython/benchmarking-public/tree/main/results/bm-20250822-3.15.0a0-efe4628-JIT

So, I've reverted that change.

Will rerun the benchmarks, to confirm...

@markshannon
Copy link
Member Author

I was hoping to get a clear cross-the-board speedup before merging this.
But as it also enables decref removal and fixes the problem of intermediate values on the stack during tracing (#140277) I think we should merge this soon and tweak it later.

Copy link
Member

@savannahostrowski savannahostrowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a quick skim of this - very neat! Thanks for also including all the perf numbers in the discussion, which helps counteract some of the initial trepidation I had around the amount of new generated code. This lays pretty solid groundwork for future optimizations as well.

Just one comment about the type change in pycore_optimizer.h

uint8_t chain_depth:6; // Must be big enough for MAX_CHAIN_DEPTH - 1.
bool warm;
int index; // Index of ENTER_EXECUTOR (if code isn't NULL, below).
int16_t index; // Index of ENTER_EXECUTOR (if code isn't NULL, below).
Copy link
Member

@savannahostrowski savannahostrowski Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this intentional? In practice, functions are very small and should probably never exceed this, but do we want a check here or some validation? I think this could overflow silently, right?


#Simple heuristic for size to avoid too much stencil duplication
def is_large(uop: Uop) -> bool:
return len(list(uop.body.tokens())) > 80
Copy link
Member

@Fidget-Spinner Fidget-Spinner Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too low, to be useful. For example, _BINARY_OP_ADD_FLOAT is 89 tokens. It should be register allocated, but instead it spills all the time because it's limited to r2_1. I've inspected the code for nbody and it spills unecessarily. We should up the limit.

I suggest increasing the limit to 150.

Copy link
Member

@Fidget-Spinner Fidget-Spinner Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some benchmark results for nbody tracing jit vs nbody tracing jit + regalloc after increasing the limit. Base commits are a few apart but they have no JIT related changes in them:

Mean +- std dev: [nbody_tracing] 107 ms +- 2 ms -> [nbody_tracing_regalloc] 78.8 ms +- 0.6 ms: 1.36x faster

36% faster. That's significantly higher than the 17-27% we've been getting.

@bedevere-app
Copy link

bedevere-app bot commented Nov 9, 2025

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

@Fidget-Spinner
Copy link
Member

Once we get Savannah's LLVM 21 PR in, we should experiment with setting the TOS cache size to 4. I observe a lot of spilling due to the loop iterator taking up some space.

@markshannon
Copy link
Member Author

markshannon commented Nov 12, 2025

I think we will want to vary the cache depending on both hardware and operating system.
We will want to vary it for different hardware, due to differing numbers of available registers.
We will want to vary for operating system, as different OSes have different calling conventions, changing the number of available registers.

All that for later PR(s) though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants