Remove prefer_threads from docstrings

christiangnrd · christiangnrd · commit 49751f2a7a68 · 2025-07-13T16:43:53.000-03:00
diff --git a/src/accumulate/accumulate.jl b/src/accumulate/accumulate.jl
@@ -38,7 +38,6 @@ include("accumulate_nd.jl")
         # CPU settings
         max_tasks::Int=Threads.nthreads(),
         min_elems::Int=2,
-        prefer_threads::Bool=true,
 
         # Algorithm choice
         alg::AccumulateAlgorithm=DecoupledLookback(),
@@ -59,7 +58,6 @@ include("accumulate_nd.jl")
         # CPU settings
         max_tasks::Int=Threads.nthreads(),
         min_elems::Int=2,
-        prefer_threads::Bool=true,
 
         # Algorithm choice
         alg::AccumulateAlgorithm=DecoupledLookback(),
@@ -82,9 +80,7 @@ we do not need the constraint of `dst` and `src` being different; to minimise me
 recommend using the single-array interface (the first one above).
 
 ## CPU
-Use at most `max_tasks` threads with at least `min_elems` elements per task. `prefer_threads` tells
-AK to prioritize using the CPU algorithm implementation (default behaviour) over the KA algorithm
-through POCL.
+Use at most `max_tasks` threads with at least `min_elems` elements per task.
 
 Note that accumulation is typically a memory-bound operation, so multithreaded accumulation only
 becomes faster if it is a more compute-heavy operation to hide memory latency - that includes:
diff --git a/src/foreachindex.jl b/src/foreachindex.jl
@@ -47,7 +47,6 @@ end
         # CPU settings
         max_tasks=Threads.nthreads(),
         min_elems=1,
-        prefer_threads::Bool=true,
 
         # GPU settings
         block_size=256,
@@ -61,8 +60,7 @@ MtlArray, oneArray - with one GPU thread per index.
 On CPUs at most `max_tasks` threads are launched, or fewer such that each thread processes at least
 `min_elems` indices; if a single task ends up being needed, `f` is inlined and no thread is
 launched. Tune it to your function - the more expensive it is, the fewer elements are needed to
-amortise the cost of launching a thread (which is a few μs). `prefer_threads` tells AK to prioritize
-using the CPU algorithm implementation (default behaviour) over the KA algorithm through POCL.
+amortise the cost of launching a thread (which is a few μs).
 
 # Examples
 Normally you would write a for loop like this:
@@ -147,7 +145,6 @@ end
         # CPU settings
         max_tasks=Threads.nthreads(),
         min_elems=1,
-        prefer_threads::Bool=true,
 
         # GPU settings
         block_size=256,
@@ -161,8 +158,7 @@ MtlArray, oneArray - with one GPU thread per index.
 On CPUs at most `max_tasks` threads are launched, or fewer such that each thread processes at least
 `min_elems` indices; if a single task ends up being needed, `f` is inlined and no thread is
 launched. Tune it to your function - the more expensive it is, the fewer elements are needed to
-amortise the cost of launching a thread (which is a few μs). `prefer_threads` tells AK to prioritize
-using the CPU algorithm implementation (default behaviour) over the KA algorithm through POCL.
+amortise the cost of launching a thread (which is a few μs).
 
 # Examples
 Normally you would write a for loop like this:
diff --git a/src/map.jl b/src/map.jl
@@ -5,7 +5,6 @@
         # CPU settings
         max_tasks=Threads.nthreads(),
         min_elems=1,
-        prefer_threads::Bool=true,
 
         # GPU settings
         block_size=256,
@@ -54,7 +53,6 @@ end
         # CPU settings
         max_tasks=Threads.nthreads(),
         min_elems=1,
-        prefer_threads::Bool=true,
 
         # GPU settings
         block_size=256,
diff --git a/src/predicates.jl b/src/predicates.jl
@@ -39,7 +39,6 @@ end
         # CPU settings
         max_tasks=Threads.nthreads(),
         min_elems=1,
-        prefer_threads::Bool=true,
 
         # GPU settings
         block_size::Int=256,
@@ -54,9 +53,7 @@ reduction.
 ## CPU
 Multithreaded parallelisation is only worth it for large arrays, relatively expensive predicates,
 and/or rare occurrence of true; use `max_tasks` and `min_elems` to only use parallelism when worth
-it in your application. When only one thread is needed, there is no overhead. `prefer_threads`
-tells AK to prioritize using the CPU algorithm implementation (default behaviour) over the KA
-algorithm through POCL.
+it in your application. When only one thread is needed, there is no overhead.
 
 ## GPU
 There are two possible `alg` choices:
@@ -176,7 +173,6 @@ end
         # CPU settings
         max_tasks=Threads.nthreads(),
         min_elems=1,
-        prefer_threads::Bool=true,
 
         # GPU settings
         block_size::Int=256,
@@ -191,9 +187,7 @@ reduction.
 ## CPU
 Multithreaded parallelisation is only worth it for large arrays, relatively expensive predicates,
 and/or rare occurrence of true; use `max_tasks` and `min_elems` to only use parallelism when worth
-it in your application. When only one thread is needed, there is no overhead. `prefer_threads`
-tells AK to prioritize using the CPU algorithm implementation (default behaviour) over the KA
-algorithm through POCL.
+it in your application. When only one thread is needed, there is no overhead.
 
 ## GPU
 There are two possible `alg` choices:
diff --git a/src/reduce/reduce.jl b/src/reduce/reduce.jl
@@ -15,7 +15,6 @@ include("mapreduce_nd.jl")
         # CPU settings
         max_tasks::Int=Threads.nthreads(),
         min_elems::Int=1,
-        prefer_threads::Bool=true,
 
         # GPU settings
         block_size::Int=256,
@@ -32,8 +31,7 @@ The returned type is the same as `init` - to control output precision, specify `
 ## CPU settings
 Use at most `max_tasks` threads with at least `min_elems` elements per task. For N-dimensional
 arrays (`dims::Int`) multithreading currently only becomes faster for `max_tasks >= 4`; all other
-cases are scaling linearly with the number of threads. `prefer_threads` tells AK to prioritize
-using the CPU algorithm implementation (default behaviour) over the KA algorithm through POCL.
+cases are scaling linearly with the number of threads.
 
 Note that multithreading reductions only improves performance for cases with more compute-heavy
 operations, which hide the memory latency and thread launch overhead - that includes:
@@ -100,7 +98,6 @@ end
         # CPU settings
         max_tasks::Int=Threads.nthreads(),
         min_elems::Int=1,
-        prefer_threads::Bool=true,
 
         # GPU settings
         block_size::Int=256,
@@ -120,8 +117,7 @@ The returned type is the same as `init` - to control output precision, specify `
 ## CPU settings
 Use at most `max_tasks` threads with at least `min_elems` elements per task. For N-dimensional
 arrays (`dims::Int`) multithreading currently only becomes faster for `max_tasks >= 4`; all other
-cases are scaling linearly with the number of threads. `prefer_threads` tells AK to prioritize
-using the CPU algorithm implementation (default behaviour) over the KA algorithm through POCL.
+cases are scaling linearly with the number of threads.
 
 ## GPU settings
 The `block_size` parameter controls the number of threads per block.
diff --git a/src/searchsorted.jl b/src/searchsorted.jl
@@ -80,7 +80,6 @@ end
         # CPU settings
         max_tasks::Int=Threads.nthreads(),
         min_elems::Int=1000,
-        prefer_threads::Bool=true,
 
         # GPU settings
         block_size::Int=256,
@@ -129,7 +128,6 @@ end
         # CPU settings
         max_tasks::Int=Threads.nthreads(),
         min_elems::Int=1000,
-        prefer_threads::Bool=true,
 
         # GPU settings
         block_size::Int=256,
@@ -165,7 +163,6 @@ end
         # CPU settings
         max_tasks::Int=Threads.nthreads(),
         min_elems::Int=1000,
-        prefer_threads::Bool=true,
 
         # GPU settings
         block_size::Int=256,
@@ -214,7 +211,6 @@ end
         # CPU settings
         max_tasks::Int=Threads.nthreads(),
         min_elems::Int=1000,
-        prefer_threads::Bool=true,
 
         # GPU settings
         block_size::Int=256,
diff --git a/src/sort/sort.jl b/src/sort/sort.jl
@@ -21,7 +21,6 @@ include("cpu_sample_sort.jl")
         # CPU settings
         max_tasks=Threads.nthreads(),
         min_elems=1,
-        prefer_threads::Bool=true,
 
         # GPU settings
         block_size::Int=256,
@@ -37,8 +36,6 @@ arguments are the same as for `Base.sort`.
 CPU settings: use at most `max_tasks` threads to sort the array such that at least `min_elems`
 elements are sorted by each thread. A parallel [`sample_sort!`](@ref) is used, processing
 independent slices of the array and deferring to `Base.sort!` for the final local sorts.
-`prefer_threads` tells AK to prioritize using the CPU algorithm implementation (default behaviour)
-over the KA algorithm through POCL.
 
 Note that the Base Julia `sort!` is mainly memory-bound, so multithreaded sorting only becomes
 faster if it is a more compute-heavy operation to hide memory latency - that includes:
@@ -129,7 +126,6 @@ end
         # CPU settings
         max_tasks=Threads.nthreads(),
         min_elems=1,
-        prefer_threads::Bool=true,
 
         # GPU settings
         block_size::Int=256,
@@ -166,7 +162,6 @@ end
         # CPU settings
         max_tasks=Threads.nthreads(),
         min_elems=1,
-        prefer_threads::Bool=true,
 
         # GPU settings
         block_size::Int=256,
@@ -243,7 +238,6 @@ end
         # CPU settings
         max_tasks=Threads.nthreads(),
         min_elems=1,
-        prefer_threads::Bool=true,
 
         # GPU settings
         block_size::Int=256,
diff --git a/test/runtests.jl b/test/runtests.jl
@@ -69,7 +69,7 @@ include("partition.jl")
 include("looping.jl")
 include("map.jl")
 include("sort.jl")
-include("reduce.jl")
+prefer_threads && include("reduce.jl") # Reduce is very broken when using the KA CPU backend
 include("accumulate.jl")
 include("predicates.jl")
 include("binarysearch.jl")