You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+43-1Lines changed: 43 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -190,6 +190,44 @@ Julia v1.11
190
190
## 1. What's Different?
191
191
As far as I am aware, this is the first cross-architecture parallel standard library *from a unified codebase* - that is, the code is written as [KernelAbstractions.jl](https://github.com/JuliaGPU/KernelAbstractions.jl) backend-agnostic kernels, which are then **transpiled** to a GPU backend; that means we benefit from all the optimisations available on the native platform and official compiler stacks. For example, unlike open standards like OpenCL that require GPU vendors to implement that API for their hardware, we target the existing official compilers. And while performance-portability libraries like [Kokkos](https://github.com/kokkos/kokkos) and [RAJA](https://github.com/LLNL/RAJA) are powerful for large C++ codebases, they require US National Lab-level development and maintenance efforts to effectively forward calls from a single API to other OpenMP, CUDA Thrust, ROCm rocThrust, oneAPI DPC++ libraries developed separately.
192
192
193
+
As a simple example, this is how a normal Julia `for`-loop can be converted to an accelerated kernel - for both multithreaded CPUs and Nvidia / AMD / Intel / Apple GPUs, **with native performance** - by changing a single line:
194
+
195
+
<table>
196
+
<tr>
197
+
<td> CPU Code </td> <td> Multithreaded / GPU code </td>
198
+
<tr>
199
+
200
+
<tr>
201
+
<td>
202
+
203
+
```julia
204
+
# Copy kernel testing throughput
205
+
206
+
functioncpu_copy!(dst, src)
207
+
for i ineachindex(src)
208
+
dst[i] = src[i]
209
+
end
210
+
end
211
+
```
212
+
213
+
</td>
214
+
<td>
215
+
216
+
```julia
217
+
import AcceleratedKernels as AK
218
+
219
+
functionak_copy!(dst, src)
220
+
AK.foreachindex(src) do i
221
+
dst[i] = src[i]
222
+
end
223
+
end
224
+
```
225
+
226
+
</td>
227
+
</tr>
228
+
</table>
229
+
230
+
193
231
Again, this is only possible because of the unique Julia compilation model, the [JuliaGPU](https://juliagpu.org/) organisation work for reusable GPU backend infrastructure, and especially the [KernelAbstractions.jl](https://github.com/JuliaGPU/KernelAbstractions.jl) backend-agnostic kernel language. Thank you.
194
232
195
233
@@ -299,6 +337,11 @@ Leave out to test the CPU backend:
299
337
$> julia -e 'import Pkg; Pkg.test("AcceleratedKernels.jl")'
300
338
```
301
339
340
+
Start Julia with multiple threads to run the tests on a multithreaded CPU backend:
341
+
```bash
342
+
$> julia --threads=4 -e 'import Pkg; Pkg.test("AcceleratedKernels.jl")'
343
+
```
344
+
302
345
303
346
## 8. Issues and Debugging
304
347
As the compilation pipeline of GPU kernels is different to that of base Julia, error messages also look different - for example, where Julia would insert an exception when a variable name was not defined (e.g. we had a typo), a GPU kernel throwing exceptions cannot be compiled and instead you'll see some cascading errors like `"[...] compiling [...] resulted in invalid LLVM IR"` caused by `"Reason: unsupported use of an undefined name"` resulting in `"Reason: unsupported dynamic function invocation"`, etc.
@@ -322,7 +365,6 @@ Help is very welcome for any of the below:
322
365
switch_below=(1, 10, 100, 1000, 10000)
323
366
end
324
367
```
325
-
- We need multithreaded implementations of `sort`, N-dimensional `mapreduce` (in `OhMyThreads.tmapreduce`) and `accumulate` (again, probably in `OhMyThreads`).
326
368
- Any way to expose the warp-size from the backends? Would be useful in reductions.
327
369
- Add a performance regressions runner.
328
370
-**Other ideas?** Post an issue, or open a discussion on the Julia Discourse.
Copy file name to clipboardExpand all lines: docs/src/api/using_backends.md
+1-3Lines changed: 1 addition & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -30,6 +30,4 @@ v = Vector(-1000:1000) # Normal CPU array
30
30
AK.reduce(+, v, max_tasks=Threads.nthreads())
31
31
```
32
32
33
-
Note the `reduce` and `mapreduce` CPU implementations forward arguments to [OhMyThreads.jl](https://github.com/JuliaFolds2/OhMyThreads.jl), an excellent package for multithreading. The focus of AcceleratedKernels.jl is to provide a unified interface to high-performance implementations of common algorithmic kernels, for both CPUs and GPUs - if you need fine-grained control over threads, scheduling, communication for specialised algorithms (e.g. with highly unequal workloads), consider using [OhMyThreads.jl](https://github.com/JuliaFolds2/OhMyThreads.jl) or [KernelAbstractions.jl](https://github.com/JuliaGPU/KernelAbstractions.jl) directly.
34
-
35
-
There is ongoing work on multithreaded CPU `sort` and `accumulate` implementations - at the moment, they fall back to single-threaded algorithms; the rest of the library is fully parallelised for both CPUs and GPUs.
33
+
By default all algorithms use the number of threads Julia was started with.
0 commit comments