-
Notifications
You must be signed in to change notification settings - Fork 14k
cuda: optimize SOLVE_TRI using registers and FMAF #17703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
- Switch from using shared memory for the RHS/solution matrix to a register-based approach (x_low, x_high), reducing shared memory pressure and bank conflicts. - Implement explicit `fmaf` instructions for the reduction loop. - Update kernel arguments to pass strides in bytes rather than elements to align with standard ggml tensor arithmetic (casting to `char *` before addition). - Remove unused `MAX_K_FAST` definition.
| float x_low = (lane < n) ? B_batch[lane * k + col_idx] : 0.0f; | ||
| float x_high = (WARP_SIZE + lane < n) ? B_batch[(WARP_SIZE + lane) * k + col_idx] : 0.0f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| float x_low = (lane < n) ? B_batch[lane * k + col_idx] : 0.0f; | |
| float x_high = (WARP_SIZE + lane < n) ? B_batch[(WARP_SIZE + lane) * k + col_idx] : 0.0f; | |
| const float x_low = (lane < n) ? B_batch[lane * k + col_idx] : 0.0f; | |
| const float x_high = (WARP_SIZE + lane < n) ? B_batch[(WARP_SIZE + lane) * k + col_idx] : 0.0f; |
Please use const wherever applicable so that one can easily tell which variables are subject to change in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both of those are changed in the #pragma unroll loops so they cant be const (;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like you're right. And confusion like this is precisely why I want a clear and consistent distinction between const and non-const vatiables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should i add a comment to clear things up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment explaining the purposes of x_low and x_high would be nice to have but not required. The problem here was rather that I read the kernel top to bottom, wasn't sure whether these particular values are supposed to be constant, didn't see the part further down where they are modified (but saw that you are not consistently adding const where applicable), and then left this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ive added const to every variable that should use it now (;
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Description
This PR optimizes the
SOLVE_TRIkernel, which is currently utilized inqwen3-nextmodels.Motivation
This kernel atm executes roughly once per layer per encoded token with
qwen3-next. Although the end-to-end throughput impact is minor (<1%), JohannesGaessler indicated that these small optimizations are in principle acceptable and who knows, it might be useful in the future.Changes
x_low,x_high), eliminating bank conflicts and doubling theoretical occupancy (37.5% -> 75%).fmaffor the diagonal update, improving instruction pipelining.MAX_K_FASTdefinition.RTX 4070 Ti llama.cpp test-backend-ops
RTX 2070 llama.cpp test-backend-ops
Baseline (Master) vs. Optimized NSIGHT Performance Comparison RTX 2070 (50 runs)
> Note: Negative percentages in Compute/L1 throughput are considered improvements in this context, as the previous kernel inflated these numbers with inefficient instructions.