Commit 5814b4d
cuda: optimize SOLVE_TRI using registers and FMAF (ggml-org#17703)
* ggml-cuda: optimize solve_tri_f32_fast and fix stride handling
- Switch from using shared memory for the RHS/solution matrix to a register-based approach (x_low, x_high), reducing shared memory pressure and bank conflicts.
- Implement explicit `fmaf` instructions for the reduction loop.
- Update kernel arguments to pass strides in bytes rather than elements to align with standard ggml tensor arithmetic (casting to `char *` before addition).
- Remove unused `MAX_K_FAST` definition.
* Small cleanup
* Remove comments in solve_tri.cu
* Update ggml/src/ggml-cuda/solve_tri.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/solve_tri.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/solve_tri.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Use const for variables in solve_tri.cu
* Replace fmaf with more readable code
* remove last fmaf
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>1 parent 79d6189 commit 5814b4d
1 file changed
+28
-36
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
6 | | - | |
7 | 6 | | |
8 | 7 | | |
9 | 8 | | |
| |||
48 | 47 | | |
49 | 48 | | |
50 | 49 | | |
51 | | - | |
52 | 50 | | |
53 | 51 | | |
54 | 52 | | |
55 | 53 | | |
56 | 54 | | |
57 | | - | |
| 55 | + | |
58 | 56 | | |
59 | 57 | | |
60 | 58 | | |
61 | 59 | | |
62 | 60 | | |
63 | | - | |
| 61 | + | |
64 | 62 | | |
65 | | - | |
66 | | - | |
67 | | - | |
68 | | - | |
69 | | - | |
70 | | - | |
71 | | - | |
| 63 | + | |
| 64 | + | |
72 | 65 | | |
73 | | - | |
| 66 | + | |
| 67 | + | |
74 | 68 | | |
75 | 69 | | |
76 | | - | |
| 70 | + | |
77 | 71 | | |
78 | | - | |
79 | | - | |
80 | | - | |
81 | | - | |
82 | | - | |
83 | | - | |
| 72 | + | |
| 73 | + | |
84 | 74 | | |
85 | | - | |
86 | | - | |
87 | | - | |
88 | | - | |
89 | | - | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
90 | 79 | | |
| 80 | + | |
91 | 81 | | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
92 | 89 | | |
93 | 90 | | |
94 | | - | |
95 | | - | |
96 | | - | |
97 | | - | |
98 | | - | |
99 | | - | |
| 91 | + | |
| 92 | + | |
100 | 93 | | |
101 | 94 | | |
102 | 95 | | |
103 | | - | |
104 | | - | |
105 | 96 | | |
106 | | - | |
107 | | - | |
108 | | - | |
109 | | - | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
110 | 102 | | |
111 | 103 | | |
112 | 104 | | |
| |||
0 commit comments