cuda: optimize SOLVE_TRI using registers and FMAF #17703

wsbagnsv1 · 2025-12-02T20:48:44Z

Description

This PR optimizes the SOLVE_TRI kernel, which is currently utilized in qwen3-next models.

Motivation

This kernel atm executes roughly once per layer per encoded token with qwen3-next. Although the end-to-end throughput impact is minor (<1%), JohannesGaessler indicated that these small optimizations are in principle acceptable and who knows, it might be useful in the future.

Changes

Shared Memory -> Registers: Switched from shared memory to a register-based approach (x_low, x_high), eliminating bank conflicts and doubling theoretical occupancy (37.5% -> 75%).
Loop Splitting: Split the main reduction loop into two stages (low/high) to eliminate conditional branching overhead for the first 32 rows.
ALU Optimization: Replaced direct division with inverse multiplication + fmaf for the diagonal update, improving instruction pipelining.
Cleanup: Removed unused MAX_K_FAST definition.

RTX 4070 Ti llama.cpp test-backend-ops

Metric	Baseline (Master)	my Kernel (New)	Improvement
Duration (us/run)	7.14	5.31	-25.63%
Throughput (GB/s)	20.29	27.29	+34.50%

RTX 2070 llama.cpp test-backend-ops

Metric	Baseline (Master)	my Kernel (New)	Improvement
Duration (us/run)	13.23	11.00	-16.86%
Throughput (GB/s)	10.96	13.18	+20.26%

Baseline (Master) vs. Optimized NSIGHT Performance Comparison RTX 2070 (50 runs)

Metric	Baseline (Master)	Optimized	Improvement
Duration (us)	14,350.080	11,251.840	-21.59%
Executed Instructions	76,848.000	65,952.000	-14.18%
Theoretical Occupancy	37.500%	75.000%	+100.00%
Registers Per Thread	25.000	23.000	-8.00%
DRAM Throughput (Bytes/sec)	11,922,005,042	14,885,891,856	+24.86%
Eligible Warps Per Scheduler	0.140	0.159	+13.57%
Warp Cycles Per Inst	11.390	10.131	-11.05%
Compute (SM) Throughput	10.128	9.040	-10.74%
L1/TEX Cache Throughput	49.591	45.303	-8.65%
L2 Hit Rate	35.801%	34.849%	-2.66%
Avg. Active Threads	32.000	32.000	0.00%
Avg. Divergent Branches	0.000	0.000	0.00%

> Note: Negative percentages in Compute/L1 throughput are considered improvements in this context, as the previous kernel inflated these numbers with inefficient instructions.

- Switch from using shared memory for the RHS/solution matrix to a register-based approach (x_low, x_high), reducing shared memory pressure and bank conflicts. - Implement explicit `fmaf` instructions for the reduction loop. - Update kernel arguments to pass strides in bytes rather than elements to align with standard ggml tensor arithmetic (casting to `char *` before addition). - Remove unused `MAX_K_FAST` definition.

JohannesGaessler · 2025-12-04T16:11:55Z

ggml/src/ggml-cuda/solve_tri.cu

+    float x_low  = (lane < n) ? B_batch[lane * k + col_idx] : 0.0f;
+    float x_high = (WARP_SIZE + lane < n) ? B_batch[(WARP_SIZE + lane) * k + col_idx] : 0.0f;


Suggested change

float x_low = (lane < n) ? B_batch[lane * k + col_idx] : 0.0f;

float x_high = (WARP_SIZE + lane < n) ? B_batch[(WARP_SIZE + lane) * k + col_idx] : 0.0f;

const float x_low = (lane < n) ? B_batch[lane * k + col_idx] : 0.0f;

const float x_high = (WARP_SIZE + lane < n) ? B_batch[(WARP_SIZE + lane) * k + col_idx] : 0.0f;

Please use const wherever applicable so that one can easily tell which variables are subject to change in the future.

Both of those are changed in the #pragma unroll loops so they cant be const (;

Seems like you're right. And confusion like this is precisely why I want a clear and consistent distinction between const and non-const vatiables.

Should i add a comment to clear things up?

A comment explaining the purposes of x_low and x_high would be nice to have but not required. The problem here was rather that I read the kernel top to bottom, wasn't sure whether these particular values are supposed to be constant, didn't see the part further down where they are modified (but saw that you are not consistently adding const where applicable), and then left this comment.

Ive added const to every variable that should use it now (;

ggml/src/ggml-cuda/solve_tri.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

ggml/src/ggml-cuda/solve_tri.cu

wsbagnsv1 added 6 commits December 2, 2025 17:34

Merge branch 'ggml-org:master' into TRI_SOLVE

4c9c634

Small cleanup

b14882f

Remove comments in solve_tri.cu

68881ef

Merge branch 'ggml-org:master' into TRI_SOLVE

a29836b

Merge branch 'ggml-org:master' into TRI_SOLVE

642e898

CISC requested a review from JohannesGaessler December 2, 2025 21:03

loci-dev mentioned this pull request Dec 2, 2025

UPSTREAM PR #17703: cuda: optimize SOLVE_TRI using registers and FMAF auroralabs-loci/llama.cpp#404

Open

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 3, 2025

JohannesGaessler reviewed Dec 4, 2025

View reviewed changes

wsbagnsv1 and others added 6 commits December 4, 2025 20:10

Update ggml/src/ggml-cuda/solve_tri.cu

c55b5bf

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

Update ggml/src/ggml-cuda/solve_tri.cu

2fd9264

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

Update ggml/src/ggml-cuda/solve_tri.cu

b27ce89

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

Merge branch 'ggml-org:master' into TRI_SOLVE

12d108a

Use const for variables in solve_tri.cu

ec9b6f9

Replace fmaf with more readable code

a34a45a

JohannesGaessler reviewed Dec 4, 2025

View reviewed changes

ggml/src/ggml-cuda/solve_tri.cu Outdated Show resolved Hide resolved

remove last fmaf

4a63709

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cuda: optimize SOLVE_TRI using registers and FMAF #17703

cuda: optimize SOLVE_TRI using registers and FMAF #17703

wsbagnsv1 commented Dec 2, 2025 •

edited

Loading

Uh oh!

JohannesGaessler Dec 4, 2025

Uh oh!

wsbagnsv1 Dec 4, 2025

Uh oh!

JohannesGaessler Dec 4, 2025

Uh oh!

wsbagnsv1 Dec 4, 2025

Uh oh!

JohannesGaessler Dec 4, 2025

Uh oh!

wsbagnsv1 Dec 4, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		float x_low = (lane < n) ? B_batch[lane * k + col_idx] : 0.0f;
		float x_high = (WARP_SIZE + lane < n) ? B_batch[(WARP_SIZE + lane) * k + col_idx] : 0.0f;

cuda: optimize SOLVE_TRI using registers and FMAF #17703

Are you sure you want to change the base?

cuda: optimize SOLVE_TRI using registers and FMAF #17703

Conversation

wsbagnsv1 commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation

Changes

RTX 4070 Ti llama.cpp test-backend-ops

RTX 2070 llama.cpp test-backend-ops

Baseline (Master) vs. Optimized NSIGHT Performance Comparison RTX 2070 (50 runs)

Uh oh!

JohannesGaessler Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

wsbagnsv1 Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

wsbagnsv1 Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

wsbagnsv1 Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wsbagnsv1 commented Dec 2, 2025 •

edited

Loading