Skip to content

Conversation

@wsbagnsv1
Copy link

@wsbagnsv1 wsbagnsv1 commented Dec 2, 2025

Description

This PR optimizes the SOLVE_TRI kernel, which is currently utilized in qwen3-next models.

Motivation

This kernel atm executes roughly once per layer per encoded token with qwen3-next. Although the end-to-end throughput impact is minor (<1%), JohannesGaessler indicated that these small optimizations are in principle acceptable and who knows, it might be useful in the future.

Changes

  • Shared Memory -> Registers: Switched from shared memory to a register-based approach (x_low, x_high), eliminating bank conflicts and doubling theoretical occupancy (37.5% -> 75%).
  • Loop Splitting: Split the main reduction loop into two stages (low/high) to eliminate conditional branching overhead for the first 32 rows.
  • ALU Optimization: Replaced direct division with inverse multiplication + fmaf for the diagonal update, improving instruction pipelining.
  • Cleanup: Removed unused MAX_K_FAST definition.

RTX 4070 Ti llama.cpp test-backend-ops

Metric Baseline (Master) my Kernel (New) Improvement
Duration (us/run) 7.14 5.31 -25.63%
Throughput (GB/s) 20.29 27.29 +34.50%

RTX 2070 llama.cpp test-backend-ops

Metric Baseline (Master) my Kernel (New) Improvement
Duration (us/run) 13.23 11.00 -16.86%
Throughput (GB/s) 10.96 13.18 +20.26%

Baseline (Master) vs. Optimized NSIGHT Performance Comparison RTX 2070 (50 runs)

Metric Baseline (Master) Optimized Improvement
Duration (us) 14,350.080 11,251.840 -21.59%
Executed Instructions 76,848.000 65,952.000 -14.18%
Theoretical Occupancy 37.500% 75.000% +100.00%
Registers Per Thread 25.000 23.000 -8.00%
DRAM Throughput (Bytes/sec) 11,922,005,042 14,885,891,856 +24.86%
Eligible Warps Per Scheduler 0.140 0.159 +13.57%
Warp Cycles Per Inst 11.390 10.131 -11.05%
Compute (SM) Throughput 10.128 9.040 -10.74%
L1/TEX Cache Throughput 49.591 45.303 -8.65%
L2 Hit Rate 35.801% 34.849% -2.66%
Avg. Active Threads 32.000 32.000 0.00%
Avg. Divergent Branches 0.000 0.000 0.00%

> Note: Negative percentages in Compute/L1 throughput are considered improvements in this context, as the previous kernel inflated these numbers with inefficient instructions.

- Switch from using shared memory for the RHS/solution matrix to a register-based approach (x_low, x_high), reducing shared memory pressure and bank conflicts.
- Implement explicit `fmaf` instructions for the reduction loop.
- Update kernel arguments to pass strides in bytes rather than elements to align with standard ggml tensor arithmetic (casting to `char *` before addition).
- Remove unused `MAX_K_FAST` definition.
@CISC CISC requested a review from JohannesGaessler December 2, 2025 21:03
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 3, 2025
Comment on lines +63 to +64
float x_low = (lane < n) ? B_batch[lane * k + col_idx] : 0.0f;
float x_high = (WARP_SIZE + lane < n) ? B_batch[(WARP_SIZE + lane) * k + col_idx] : 0.0f;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
float x_low = (lane < n) ? B_batch[lane * k + col_idx] : 0.0f;
float x_high = (WARP_SIZE + lane < n) ? B_batch[(WARP_SIZE + lane) * k + col_idx] : 0.0f;
const float x_low = (lane < n) ? B_batch[lane * k + col_idx] : 0.0f;
const float x_high = (WARP_SIZE + lane < n) ? B_batch[(WARP_SIZE + lane) * k + col_idx] : 0.0f;

Please use const wherever applicable so that one can easily tell which variables are subject to change in the future.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of those are changed in the #pragma unroll loops so they cant be const (;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you're right. And confusion like this is precisely why I want a clear and consistent distinction between const and non-const vatiables.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should i add a comment to clear things up?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment explaining the purposes of x_low and x_high would be nice to have but not required. The problem here was rather that I read the kernel top to bottom, wasn't sure whether these particular values are supposed to be constant, didn't see the part further down where they are modified (but saw that you are not consistently adding const where applicable), and then left this comment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ive added const to every variable that should use it now (;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants