-
Notifications
You must be signed in to change notification settings - Fork 13.9k
SOLVE_TRI CUDA kernel for small matrices #17457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hey, I've created a small framework for OpenEvolve for this kernel and ran it for about 40 iterations (I plan to do around 600) and already got around 8% improvement on the kernel below. I'm pretty sure I've covered all test cases, but you should check it for correctness if I missed something in my test cases. Anyway, here is the performance improvement on my old RTX 2070 for the kernel: Oh and this was reproducible with multiple runs over time and showed a consistent improvement (; This is the kernel: #include <cuda_fp16.h>
#define MAX_N_FAST 64
#define MAX_K_FAST 32
#define WARP_SIZE 32
// Warp reduction helper with full mask for safety
static __inline__ __device__ float warpReduceSum(float val) {
// Use full mask for all participating threads
unsigned mask = __activemask();
for (int offset = WARP_SIZE / 2; offset > 0; offset /= 2) {
val += __shfl_down_sync(mask, val, offset);
}
return val;
}
// Optimized kernel focusing on coalesced access and warp-level parallelism
extern "C" __global__ void solve_tri_f32_fast(
const float* __restrict__ A,
const float* __restrict__ B,
float* __restrict__ X,
int n, int k,
int64_t ne02, int64_t ne03,
size_t nb02, size_t nb03,
size_t nb12, size_t nb13,
size_t nb2, size_t nb3)
{
const int batch_idx = blockIdx.x;
const int lane = threadIdx.x;
const int col_idx = threadIdx.y;
const int tid = threadIdx.x + threadIdx.y * blockDim.x;
// Early exit for excess warps
if (col_idx >= k) {
return;
}
// Calculate batch indices
const int64_t i03 = batch_idx / ne02;
const int64_t i02 = batch_idx % ne02;
// Get pointers for this batch
const float* const A_batch = (const float*)((const char *)A + i02 * nb02 + i03 * nb03);
const float* const B_batch = (const float*)((const char *)B + i02 * nb12 + i03 * nb13);
float* X_batch = (float*) ((char *)X + i02 * nb2 + i03 * nb3);
// Shared memory for A and B matrices
__shared__ float sA[MAX_N_FAST * MAX_N_FAST];
__shared__ float sX[MAX_N_FAST * MAX_K_FAST];
// Coalesced loading of A matrix
// Each thread loads multiple elements to improve bandwidth utilization
const int total_elements_A = n * n;
const int stride_A = blockDim.x * blockDim.y;
for (int i = tid; i < total_elements_A; i += stride_A) {
sA[i] = A_batch[i];
}
// Coalesced loading of B matrix
const int total_elements_B = n * k;
const int stride_B = blockDim.x * blockDim.y;
for (int i = tid; i < total_elements_B; i += stride_B) {
sX[i] = B_batch[i];
}
__syncthreads();
// Forward substitution with warp-level parallelism
// Each warp processes one column of the solution
for (int row = 0; row < n; ++row) {
float sum = 0.0f;
// Use register accumulation for better ILP
float sum_part = 0.0f;
// Parallel computation of dot product
// Each thread in the warp processes a subset of elements
for (int j = lane; j < row; j += WARP_SIZE) {
// Load row of A and column of X from shared memory
sum_part += sA[row * n + j] * sX[j * k + col_idx];
}
// Warp-level reduction to get the final sum
sum = warpReduceSum(sum_part);
// Lane 0 performs the final computation and stores result
if (lane == 0) {
const float b_val = sX[row * k + col_idx];
const float a_diag = sA[row * n + row];
// The safe exact check:
if (a_diag != 0.0f) {
sX[row * k + col_idx] = (b_val - sum) / a_diag;
} else {
sX[row * k + col_idx] = 0.0f; // Only catch true division by zero
}
}
// Synchronize to ensure all threads see the updated value
__syncthreads();
}
// Coalesced write back to global memory
for (int i = tid; i < total_elements_B; i += stride_B) {
X_batch[i] = sX[i];
}
}For those curious, OpenEvolve implements the AlphaEvolve approach from Google (or at least it started as that): using LLMs to iteratively evolve and optimize algorithms. Also ill upload the framework tomorrow to my github for anyone interested (; |
|
@am17an something like this? |
|
@pwilkin not quite, something like this https://github.com/pwilkin/llama.cpp/compare/solve_tri_cuda...am17an:llama.cpp:solve_tri_cuda_opt?expand=1 With this I get |
|
One other thing, you don't need an entirely separate function for the general case, you can pass 0 as the template parameter and do an |
ggml/src/ggml-cuda/solve_tri.cu
Outdated
| __shared__ float sX[MAX_N_FAST * MAX_K_FAST]; | ||
|
|
||
| // Load A into shared memory (coalesced) | ||
| #pragma unroll |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These cannot be unrolled so you can remove
| #pragma unroll |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this whole function can go away, it should be something like
if constexpr(n == 0) {
//take this path
} else {
#pragma unroll
//the fast loop
}
ggml/src/ggml-cuda/solve_tri.cu
Outdated
| __shared__ float sX[MAX_N_FAST * MAX_K_FAST]; | ||
|
|
||
| // Load A into shared memory (coalesced) | ||
| #pragma unroll |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this whole function can go away, it should be something like
if constexpr(n == 0) {
//take this path
} else {
#pragma unroll
//the fast loop
}|
@am17an not very experienced with this, but I believe this is what you had in mind? |
am17an
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah looks good. You can run it through clang-format once. Also let's get @JohannesGaessler to take a look as well, he usually has the best ideas re performance
|
For some reason, when testing with test-backend-ops, this case fails: logs edit: would appreciate if anyone with a sm89 GPU could reproduce this |
Can confirm. My 3080 (which is my CUDA0) works correctly, but my 5060 fails with the pattern you described. @am17an any ideas? Looks like some race condition, but why only on 89+ arch cards? |
|
Never mind; needed to add a guard, apparently for 30x0 it didn't mind. |
ggml/src/ggml-cuda/solve_tri.cu
Outdated
| const float * const A_batch = (const float *) ((const char *) A + i02 * nb02 + i03 * nb03); | ||
| const float * const B_batch = (const float *) ((const char *) B + i02 * nb12 + i03 * nb13); | ||
| float * X_batch = (float *) ((char *) X + i02 * nb2 + i03 * nb3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally speaking it is preferable to pass the strides in units of float instead of char.
ggml/src/ggml-cuda/solve_tri.cuh
Outdated
| @@ -0,0 +1,5 @@ | |||
| #include "common.cuh" | |||
|
|
|||
| #define CUDA_SOLVE_TRI_BLOCK_SIZE 256 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kernel does not respect this define. More generally, you are launching a kernel with up to 32*32=1024 threads which is in principle still possible but becomes problematic in terms of register pressure. My recommendation would be to launch at most 256 threads, to specify this upper limit via __launch_bounds__, and to handle the cases which currently use > 8 warps with a loop.
I think the reason you needed to add a guard is because the number of warps can be != a power of 2. |
|
Have a look at #17486, on my system the perf is >2x better than this cuda kernel. There's probably still some room for improvement when K is small. |
And here I thought I could get away with it easily ;) oh well, guess it's time to learn how to use NSight. |
|
I managed to get a further ~30% improvement by removing the division-by-zero guard (it doesn't make much sense there anyway). Now I'm in this bizarre state where my 3080 is twice as slow as my 5060 and I don't know why: Device description: NVIDIA GeForce RTX 3080
Device memory: 9871 MB (11009 MB free)
SOLVE_TRI(type=f32,ne_lhs=[64,64,4,2],ne_rhs=[6,64,4,2]): 65520 runs - 15.42 us/run - 152 kB/run - 9.40 GB/s
Device description: NVIDIA GeForce RTX 5060 Ti
Device memory: 15848 MB (11091 MB free)
SOLVE_TRI(type=f32,ne_lhs=[64,64,4,2],ne_rhs=[6,64,4,2]): 131040 runs - 7.87 us/run - 152 kB/run - 18.41 GB/sAny idea what could be the reason? |
|
Got another 20% with moving second sync outside of loop, but difference between 3080 and 5060 persists. |
|
Ive found some more optimizations for your kernel which reliably improved performance by a good amount on both my gpus Current pr: Device description: NVIDIA GeForce RTX 4070 Ti
Device memory: 12281 MB (11036 MB free)
SOLVE_TRI(type=f32,ne_lhs=[64,64,4,2],ne_rhs=[6,64,4,2]): 90090 runs - 11.26 us/run - 152 kB/run - 12.87 GB/s
Device description: NVIDIA GeForce RTX 2070
Device memory: 8191 MB (7144 MB free)
SOLVE_TRI(type=f32,ne_lhs=[64,64,4,2],ne_rhs=[6,64,4,2]): 57330 runs - 18.01 us/run - 152 kB/run - 8.05 GB/sMy improved kernel: Device description: NVIDIA GeForce RTX 4070 Ti
Device memory: 12281 MB (11036 MB free)
SOLVE_TRI(type=f32,ne_lhs=[64,64,4,2],ne_rhs=[6,64,4,2]): 147420 runs - 6.93 us/run - 152 kB/run - 20.91 GB/s
Device description: NVIDIA GeForce RTX 2070
Device memory: 8191 MB (7144 MB free)
SOLVE_TRI(type=f32,ne_lhs=[64,64,4,2],ne_rhs=[6,64,4,2]): 81900 runs - 13.28 us/run - 152 kB/run - 10.91 GB/s
Ive ran more than 25 runs and the gain is consistent. Here are the proposed changes all in solve_tri.cu: @@ -47,7 +47,8 @@ static __global__ void solve_tri_f32_fast(const float * __restrict__ A,
float * X_batch = (float *) ((char *) X + i02 * nb2 + i03 * nb3);
__shared__ float sA[MAX_N_FAST * MAX_N_FAST];
- __shared__ float sX[MAX_N_FAST * (MAX_K_FAST + 1)];
+ __shared__ float sX[MAX_N_FAST * MAX_K_FAST];
+ __shared__ float sXt[MAX_K_FAST * MAX_N_FAST];
const int offset = threadIdx.x + threadIdx.y * blockDim.x;
@@ -60,7 +61,7 @@ static __global__ void solve_tri_f32_fast(const float * __restrict__ A,
}
#pragma unroll
- for (int i = 0; i < n * (k + 1); i += k * WARP_SIZE) {
+ for (int i = 0; i < n * k; i += k * WARP_SIZE) {
int i0 = i + offset;
if (i0 < n * k) {
sX[i0] = B_batch[i0];
@@ -69,22 +70,52 @@ static __global__ void solve_tri_f32_fast(const float * __restrict__ A,
__syncthreads();
+ // Transpose sX (row-major) to sXt (column-major per column)
+#pragma unroll 2
+ for (int rr = 0; rr < 2; ++rr) {
+ int row = rr * WARP_SIZE + lane;
+ if (row < n) {
+ sXt[col_idx * n + row] = sX[row * k + col_idx];
+ }
+ }
+
+ __syncthreads();
+
+ // Forward substitution
#pragma unroll
for (int row = 0; row < n; ++row) {
float sum = 0.0f;
-#pragma unroll
- for (int j = lane; j < row; j += WARP_SIZE) {
- sum += sA[row * n + j] * sX[j * k + col_idx];
+ // First warp
+ {
+ int j = lane;
+ if (j < row) {
+ sum += sA[row * n + j] * sXt[col_idx * n + j];
+ }
+ }
+ // Second warp
+ if (row >= WARP_SIZE) {
+ int j = WARP_SIZE + lane;
+ if (j < row) {
+ sum += sA[row * n + j] * sXt[col_idx * n + j];
+ }
}
sum = warp_reduce_sum(sum);
if (lane == 0) {
- const float b_val = sX[row * k + col_idx]; // Value from B
- const float a_diag = sA[row * n + row];
- // no safeguards for division by zero because that indicates corrupt data anyway
- sX[row * k + col_idx] = (b_val - sum) / a_diag;
+ const float diag = sA[row * n + row];
+ const float b_val = sXt[col_idx * n + row];
+ sXt[col_idx * n + row] = (b_val - sum) / diag;
+ }
+ }
+
+ // Transpose back sXt to sX (row-major)
+#pragma unroll 2
+ for (int rr = 0; rr < 2; ++rr) {
+ int row = rr * WARP_SIZE + lane;
+ if (row < n) {
+ sX[row * k + col_idx] = sXt[col_idx * n + row];
}
}As for how I was able to improve the kernel, I created a new framework for open evolve that basically uses nsight to improve the kernel via evolution and uses llama.cpp itself as the benchmark and i got this kernel after only 6 iterations. I hope this helps! |
|
@wsbagnsv1 nice one, I was literally going to do the transpose trick as the next optimization :) |
|
@wsbagnsv1 btw, care to upload your openevolve evaluator somewhere? |
|
@wsbagnsv1 BTW it's always good to look at the generated kernels as well :) Your kernel has completely unneeded writes to an extra array - you only write to |
|
@JohannesGaessler @am17an I think this is as far as I can push this, I believe it's already pretty well optimized. |
|
Nice! I think you need another clang-format and preferably pass the stride in units of float rather than bytes |
|
Cleaned it up. Final results: |
am17an
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a new nitpicks, can merge post that
Yeah have to clean it up a bit and I'll put it on my github, had to hardcode the paths for the compilation etc 😅 |
Probably because it was a rather early iteration, ill check the later ones once I'm back on my pc but yeah you're right didn't have much time left before sleep and saw that massive improvement and just copy pasted it basically so its at least online 😅 |
|
Perf on my system: |
|
my own results 4060Ti and 3060 performanceCuriously, it seems there isn't much scaling within a given GPU family... |
|
@theo77186 it's a small kernel, doesn't utilize all the cores, thus no scaling. Probably would scale more on a tiled version. |
Ive just created my fork of openevolve and uploaded my changes with the solve_tri kernel, maybe it can be used to find a proper optimized tiled version of this kernel too if we even need it, though wed have to change the parser in the evaluator and the config.yaml. But thats for another pr i guess 😉 |
|
@am17an Aight, CI tests are fine, think we can merge. |
| float * X_batch = (float *) (X + i02 * nb2 / sizeof(float) + i03 * nb3 / sizeof(float)); | ||
|
|
||
| __shared__ float sA[MAX_N_FAST * MAX_N_FAST]; | ||
| __shared__ float sXt[MAX_N_FAST * (MAX_K_FAST + 1)]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply changing the size of a 1D allocation like this does nothing to fix shared memory bank conflicts. You have to actually access elements with the padded stride. One way to do this automatically is to change the array shape to be 2D and to pad the last dimension.
|
Sorry, I forgot to press the submit button on my review. It is one more optimization that could be done, though for any optimizations you should first check what percentage of the runtime the operation takes up in the first place (because that is the maximum percentage that you can shave off). |
I've managed to actually poke enough LLMs in the correct direction to end up with this:
CPU:
SOLVE_TRI(type=f32,ne_lhs=[64,64,4,2],ne_rhs=[6,64,4,2]): 32760 runs - 37.34 us/run - 152 kB/run - 3.88 GB/sCUDA:
SOLVE_TRI(type=f32,ne_lhs=[64,64,4,2],ne_rhs=[6,64,4,2]): 49140 runs - 22.52 us/run - 152 kB/run - 6.44 GB/sThis can most certainly be improved by someone who knows what they're doing, but at least it does the bare minimum by supplying a CUDA kernel that's around twice as fast as the optimized CPU implementation.