@@ -40,7 +40,7 @@ include("accumulate_nd.jl")
4040 min_elems::Int=2,
4141
4242 # Algorithm choice
43- alg::AccumulateAlgorithm=DecoupledLookback (),
43+ alg::AccumulateAlgorithm=ScanPrefixes (),
4444
4545 # GPU settings
4646 block_size::Int=256,
@@ -60,7 +60,7 @@ include("accumulate_nd.jl")
6060 min_elems::Int=2,
6161
6262 # Algorithm choice
63- alg::AccumulateAlgorithm=DecoupledLookback (),
63+ alg::AccumulateAlgorithm=ScanPrefixes (),
6464
6565 # GPU settings
6666 block_size::Int=256,
@@ -89,13 +89,13 @@ becomes faster if it is a more compute-heavy operation to hide memory latency -
8989
9090## GPU
9191For the 1D case (`dims=nothing`), the `alg` can be one of the following:
92- - `DecoupledLookback()`: the default algorithm, using opportunistic lookback to reuse earlier
93- blocks' results; requires device-level memory consistency guarantees, which Apple Metal does not
94- provide.
95- - `ScanPrefixes()`: a simpler algorithm that scans the prefixes of each block, with no lookback; it
96- has similar performance as `DecoupledLookback()` for large block sizes, and small to medium arrays,
92+ - `ScanPrefixes()`: the default algorithm that scans the prefixes of each block, with no lookback; it
93+ has better performance than `DecoupledLookback()` for large block sizes, and small to medium arrays,
9794 but poorer scaling for many blocks; there is no performance degradation below `block_size^2`
98- elements.
95+ elements, but it remains fast well into millions of elements.
96+ - `DecoupledLookback()`: a more complex algorithm using opportunistic lookback to reuse earlier
97+ blocks' results; requires device-level memory consistency guarantees (which Apple Metal does not
98+ provide) and atomic orderings; theoretically more scalable for many blocks.
9999
100100A different, unique algorithm is used for the multi-dimensional case (`dims` is an integer).
101101
@@ -105,13 +105,7 @@ The temporaries are only used for the 1D case (`dims=nothing`): `temp` stores pe
105105`temp_flags` is only used for the `DecoupledLookback()` algorithm for flagging if blocks are ready;
106106they should both have at least `(length(v) + 2 * block_size - 1) ÷ (2 * block_size)` elements; also,
107107`eltype(v) === eltype(temp)` is required; the elements in `temp_flags` can be any integers, but
108- `Int8` is used by default to reduce memory usage.
109-
110- # Platform-Specific Notes
111- On Metal, the `alg=ScanPrefixes()` algorithm is used by default, as Apple Metal GPUs do not have
112- strong enough memory consistency guarantees for the `DecoupledLookback()` algorithm - which
113- produces incorrect results about 0.38% of the time (the beauty of parallel algorithms, ey). Also,
114- `block_size=1024` is used here by default to reduce the number of coupled lookbacks.
108+ `UInt8` is used by default to reduce memory usage.
115109
116110# Examples
117111Example computing an inclusive prefix sum (the typical GPU "scan"):
@@ -123,7 +117,7 @@ v = oneAPI.ones(Int32, 100_000)
123117AK.accumulate!(+, v, init=0)
124118
125119# Use a different algorithm
126- AK.accumulate!(+, v, alg=AK.ScanPrefixes ())
120+ AK.accumulate!(+, v, alg=AK.DecoupledLookback ())
127121```
128122"""
129123function accumulate! (
@@ -160,8 +154,6 @@ function _accumulate_impl!(
160154 dims:: Union{Nothing, Int} = nothing ,
161155 inclusive:: Bool = true ,
162156
163- # FIXME : Switch back to `DecoupledLookback()` as the default algorithm
164- # once https://github.com/JuliaGPU/AcceleratedKernels.jl/pull/44 is merged.
165157 alg:: AccumulateAlgorithm = ScanPrefixes (),
166158
167159 # CPU settings
214206 min_elems::Int=2,
215207
216208 # Algorithm choice
217- alg::AccumulateAlgorithm=DecoupledLookback (),
209+ alg::AccumulateAlgorithm=ScanPrefixes (),
218210
219211 # GPU settings
220212 block_size::Int=256,
0 commit comments