Skip to content

Commit 14de3f2

Browse files
authored
Merge pull request #40 from JuliaGPU/multithreaded-algorithms
Added multithreaded implementations of 1D and ND accumulate, mapreduce. Removed Polyester and OhMyThreads dependencies.
2 parents 111c89b + 32e7dbe commit 14de3f2

32 files changed

+919
-639
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,3 +25,6 @@ Manifest.toml
2525

2626
# Local environment files
2727
.vscode/settings.json
28+
29+
# Profile files
30+
profile.pb.gz

Project.toml

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,13 @@
11
name = "AcceleratedKernels"
22
uuid = "6a4ca0a5-0e36-4168-a932-d9be78d558f1"
33
authors = ["Andrei-Leonard Nicusan <leonard@evophase.co.uk> and contributors"]
4-
version = "0.3.4"
4+
version = "0.4.0"
55

66
[deps]
77
ArgCheck = "dce04be8-c92d-5529-be00-80e4d2c0e197"
88
GPUArraysCore = "46192b85-c4d5-4398-a991-12ede77f4527"
99
KernelAbstractions = "63c18a36-062a-441e-b654-da1e3ab1ce7c"
1010
Markdown = "d6f4376e-aef5-505a-96c1-9c027394607a"
11-
OhMyThreads = "67456a42-1dca-4109-a031-0a68de7e3ad5"
12-
Polyester = "f517fe37-dbe3-4b94-8317-1923a5111588"
1311

1412
[weakdeps]
1513
Metal = "dde4c033-4e86-420c-a63e-0dd931031962"
@@ -25,7 +23,5 @@ GPUArraysCore = "0.2.0"
2523
KernelAbstractions = "0.9.34"
2624
Markdown = "1"
2725
Metal = "1"
28-
OhMyThreads = "0.7, 0.8"
29-
Polyester = "0.7"
30-
julia = "1.10"
3126
oneAPI = "1, 2"
27+
julia = "1.10"

README.md

Lines changed: 43 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -190,6 +190,44 @@ Julia v1.11
190190
## 1. What's Different?
191191
As far as I am aware, this is the first cross-architecture parallel standard library *from a unified codebase* - that is, the code is written as [KernelAbstractions.jl](https://github.com/JuliaGPU/KernelAbstractions.jl) backend-agnostic kernels, which are then **transpiled** to a GPU backend; that means we benefit from all the optimisations available on the native platform and official compiler stacks. For example, unlike open standards like OpenCL that require GPU vendors to implement that API for their hardware, we target the existing official compilers. And while performance-portability libraries like [Kokkos](https://github.com/kokkos/kokkos) and [RAJA](https://github.com/LLNL/RAJA) are powerful for large C++ codebases, they require US National Lab-level development and maintenance efforts to effectively forward calls from a single API to other OpenMP, CUDA Thrust, ROCm rocThrust, oneAPI DPC++ libraries developed separately.
192192

193+
As a simple example, this is how a normal Julia `for`-loop can be converted to an accelerated kernel - for both multithreaded CPUs and Nvidia / AMD / Intel / Apple GPUs, **with native performance** - by changing a single line:
194+
195+
<table>
196+
<tr>
197+
<td> CPU Code </td> <td> Multithreaded / GPU code </td>
198+
<tr>
199+
200+
<tr>
201+
<td>
202+
203+
```julia
204+
# Copy kernel testing throughput
205+
206+
function cpu_copy!(dst, src)
207+
for i in eachindex(src)
208+
dst[i] = src[i]
209+
end
210+
end
211+
```
212+
213+
</td>
214+
<td>
215+
216+
```julia
217+
import AcceleratedKernels as AK
218+
219+
function ak_copy!(dst, src)
220+
AK.foreachindex(src) do i
221+
dst[i] = src[i]
222+
end
223+
end
224+
```
225+
226+
</td>
227+
</tr>
228+
</table>
229+
230+
193231
Again, this is only possible because of the unique Julia compilation model, the [JuliaGPU](https://juliagpu.org/) organisation work for reusable GPU backend infrastructure, and especially the [KernelAbstractions.jl](https://github.com/JuliaGPU/KernelAbstractions.jl) backend-agnostic kernel language. Thank you.
194232

195233

@@ -299,6 +337,11 @@ Leave out to test the CPU backend:
299337
$> julia -e 'import Pkg; Pkg.test("AcceleratedKernels.jl")'
300338
```
301339

340+
Start Julia with multiple threads to run the tests on a multithreaded CPU backend:
341+
```bash
342+
$> julia --threads=4 -e 'import Pkg; Pkg.test("AcceleratedKernels.jl")'
343+
```
344+
302345

303346
## 8. Issues and Debugging
304347
As the compilation pipeline of GPU kernels is different to that of base Julia, error messages also look different - for example, where Julia would insert an exception when a variable name was not defined (e.g. we had a typo), a GPU kernel throwing exceptions cannot be compiled and instead you'll see some cascading errors like `"[...] compiling [...] resulted in invalid LLVM IR"` caused by `"Reason: unsupported use of an undefined name"` resulting in `"Reason: unsupported dynamic function invocation"`, etc.
@@ -322,7 +365,6 @@ Help is very welcome for any of the below:
322365
switch_below=(1, 10, 100, 1000, 10000)
323366
end
324367
```
325-
- We need multithreaded implementations of `sort`, N-dimensional `mapreduce` (in `OhMyThreads.tmapreduce`) and `accumulate` (again, probably in `OhMyThreads`).
326368
- Any way to expose the warp-size from the backends? Would be useful in reductions.
327369
- Add a performance regressions runner.
328370
- **Other ideas?** Post an issue, or open a discussion on the Julia Discourse.

docs/src/api/using_backends.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,4 @@ v = Vector(-1000:1000) # Normal CPU array
3030
AK.reduce(+, v, max_tasks=Threads.nthreads())
3131
```
3232

33-
Note the `reduce` and `mapreduce` CPU implementations forward arguments to [OhMyThreads.jl](https://github.com/JuliaFolds2/OhMyThreads.jl), an excellent package for multithreading. The focus of AcceleratedKernels.jl is to provide a unified interface to high-performance implementations of common algorithmic kernels, for both CPUs and GPUs - if you need fine-grained control over threads, scheduling, communication for specialised algorithms (e.g. with highly unequal workloads), consider using [OhMyThreads.jl](https://github.com/JuliaFolds2/OhMyThreads.jl) or [KernelAbstractions.jl](https://github.com/JuliaGPU/KernelAbstractions.jl) directly.
34-
35-
There is ongoing work on multithreaded CPU `sort` and `accumulate` implementations - at the moment, they fall back to single-threaded algorithms; the rest of the library is fully parallelised for both CPUs and GPUs.
33+
By default all algorithms use the number of threads Julia was started with.

ext/AcceleratedKernelsMetalExt.jl

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,10 @@ function AK.accumulate!(
1414
dims::Union{Nothing, Int}=nothing,
1515
inclusive::Bool=true,
1616

17+
# CPU settings - not used
18+
max_tasks::Int=Threads.nthreads(),
19+
min_elems::Int=1,
20+
1721
# Algorithm choice
1822
alg::AK.AccumulateAlgorithm=AK.ScanPrefixes(),
1923

@@ -39,6 +43,10 @@ function AK.accumulate!(
3943
dims::Union{Nothing, Int}=nothing,
4044
inclusive::Bool=true,
4145

46+
# CPU settings - not used
47+
max_tasks::Int=Threads.nthreads(),
48+
min_elems::Int=1,
49+
4250
# Algorithm choice
4351
alg::AK.AccumulateAlgorithm=AK.ScanPrefixes(),
4452

@@ -63,6 +71,10 @@ function AK.cumsum(
6371
neutral=zero(eltype(src)),
6472
dims::Union{Nothing, Int}=nothing,
6573

74+
# CPU settings - not used
75+
max_tasks::Int=Threads.nthreads(),
76+
min_elems::Int=1,
77+
6678
# Algorithm choice
6779
alg::AK.AccumulateAlgorithm=AK.ScanPrefixes(),
6880

@@ -93,6 +105,10 @@ function AK.cumprod(
93105
neutral=one(eltype(src)),
94106
dims::Union{Nothing, Int}=nothing,
95107

108+
# CPU settings - not used
109+
max_tasks::Int=Threads.nthreads(),
110+
min_elems::Int=1,
111+
96112
# Algorithm choice
97113
alg::AK.AccumulateAlgorithm=AK.ScanPrefixes(),
98114

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[deps]
2+
AcceleratedKernels = "6a4ca0a5-0e36-4168-a932-d9be78d558f1"
3+
BenchmarkTools = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
2+
import AcceleratedKernels as AK
3+
using BenchmarkTools
4+
5+
6+
7+
v = rand(1_000_000)
8+
init = eltype(v)(0)
9+
10+
r1 = Base.accumulate(+, v; init=init)
11+
r2 = AK.accumulate(+, v; init=init)
12+
13+
@assert r1 == r2
14+
15+
16+
v = rand(1_000_000)
17+
init = eltype(v)(0)
18+
19+
println("1D Benchmark - Base vs. AK")
20+
display(@benchmark Base.accumulate(+, v; init=init))
21+
display(@benchmark AK.accumulate(+, v; init=init))
22+
23+
24+
v = rand(100, 100, 100)
25+
init = eltype(v)(0)
26+
27+
println("3D Benchmark - Base vs. AK")
28+
display(@benchmark Base.accumulate(+, v; init=init, dims=2))
29+
display(@benchmark AK.accumulate(+, v; init=init, dims=2))
30+
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[deps]
2+
AcceleratedKernels = "6a4ca0a5-0e36-4168-a932-d9be78d558f1"
3+
OhMyThreads = "67456a42-1dca-4109-a031-0a68de7e3ad5"
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
2+
import AcceleratedKernels as AK
3+
using BenchmarkTools
4+
5+
6+
v = rand(1_000_000)
7+
f(x) = x^2
8+
op(x, y) = x + y
9+
init = eltype(v)(0)
10+
11+
println("1D Benchmark - Base vs. AK")
12+
display(@benchmark Base.mapreduce(f, op, v; init=init))
13+
display(@benchmark AK.mapreduce(f, op, v; init=init, neutral=init))
14+
15+
16+
v = rand(100, 100, 100)
17+
f(x) = x^2
18+
op(x, y) = x + y
19+
init = eltype(v)(0)
20+
21+
println("3D Benchmark - Base vs. AK")
22+
display(@benchmark Base.mapreduce(f, op, v; init=init, dims=2))
23+
display(@benchmark AK.mapreduce(f, op, v; init=init, neutral=init, dims=2))
24+
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
2+
import AcceleratedKernels as AK
3+
import OhMyThreads as OMT
4+
using BenchmarkTools
5+
6+
7+
# Turns out we have the same performance as tmapreduce with just AK base threading
8+
function mapreduce_omt(f, op, v; init)
9+
# MapReduce using OhMyThreads
10+
return OMT.tmapreduce(f, op, v; init=init)
11+
end
12+
13+
14+
function mapreduce_ak(f, op, v; init, max_tasks=Threads.nthreads())
15+
# MapReduce using AcceleratedKernels
16+
if max_tasks == 1
17+
return Base.mapreduce(f, op, v; init=init)
18+
end
19+
20+
shared = Vector{typeof(init)}(undef, max_tasks)
21+
AK.itask_partition(length(v), max_tasks) do itask, irange
22+
@inbounds begin
23+
shared[itask] = Base.mapreduce(f, op, @view(v[irange]); init=init)
24+
end
25+
end
26+
return Base.reduce(op, shared; init=init)
27+
end
28+
29+
30+
v = rand(1_000_000)
31+
f(x) = x^2
32+
op(x, y) = x + y
33+
init = eltype(v)(0)
34+
35+
@assert mapreduce_omt(f, op, v; init=init) == mapreduce_ak(f, op, v; init=init)
36+
37+
display(@benchmark mapreduce_omt(f, op, v; init=init))
38+
display(@benchmark mapreduce_ak(f, op, v; init=init))

0 commit comments

Comments
 (0)