Skip to content

Commit 2efebd6

Browse files
committed
commit after merge
1 parent b747bb7 commit 2efebd6

File tree

2 files changed

+39
-10
lines changed

2 files changed

+39
-10
lines changed

docs/src/lecture_10/lecture.md

Lines changed: 39 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -464,7 +464,9 @@ a[:a] === b[:a]
464464
end
465465
@everywhere show_secret()
466466

467-
remotecall_fetch(g -> eval(:(g = $(g))), 2, g)
467+
for i in workers()
468+
remotecall_fetch(g -> eval(:(g = $(g))), i, g)
469+
end
468470
@everywhere show_secret()
469471
```
470472
which is implemented in the `ParallelDataTransfer.jl` with other variants, but in general, this construct should be avoided.
@@ -480,7 +482,7 @@ end
480482
Our advices earned by practice are:
481483
- to have shared directory (shared home) with code and to share the location of packages
482484
- to place all code for workers to one file, let's call it `worker.jl` (author of this includes the code for master as well).
483-
- put to the beggining of `worker.jl` code activating specified environment as
485+
- put to the beggining of `worker.jl` code activating specified environment as (or specify environmnet for all workers in environment variable as `export JULIA_PROJECT="$PWD"`)
484486
```julia
485487
using Pkg
486488
Pkg.activate(@__DIR__)
@@ -550,7 +552,7 @@ end
550552
with benchmark
551553
```
552554
julia> @btime juliaset_static(-0.79, 0.15, 1000);
553-
16.206 ms (26 allocations: 978.42 KiB)
555+
15.751 ms (27 allocations: 978.75 KiB)
554556
```
555557
Although we have used four-threads, and the communication overhead should be next to zero, the speed improvement is ``2.4``. Why is that?
556558

@@ -560,7 +562,7 @@ using LoggingProfiler
560562
function juliaset_static(x, y, n=1000)
561563
c = x + y*im
562564
img = Array{UInt8,2}(undef,n,n)
563-
Threads.@threads for j in 1:n
565+
Threads.@threads :dynamic for j in 1:n
564566
LoggingProfiler.@recordfun juliaset_column!(img, c, n, j)
565567
end
566568
return img
@@ -577,7 +579,9 @@ LoggingProfiler.export2luxor("profile.png")
577579
![profile.png](profile.png)
578580
From the visualization of the profiler we can see not all threads were working the same time. Thread 1 and 4 were working less that Thread 2 and 3. The reason is that the static scheduller partition the total number of columns (1000) into equal parts, where the total number of parts is equal to the number of threads, and assign each to a single thread. In our case, we will have four parts each of size 250. Since execution time of computing value of each pixel is not the same, threads with a lot zero iterations will finish considerably faster. This is the incarnation of one of the biggest problems in multi-threadding / schedulling. A contemprary approach is to switch to dynamic schedulling, which divides the problem into smaller parts, and when a thread is finished with one part, it assigned new not-yet computed part.
579581

580-
Dynamic scheduller is supported using `Threads.@spawn` macro. The prototypical approach is the fork-join model, where one recursivelly partitions the problems and wait in each thread for the other
582+
From 1.5, one can specify the scheduller for `Threads.@thread [scheduller] for` construct to be either `:static` and / or `:dynamic`. The `:dynamic` is compatible with the `partr` dynamic scheduller. From `1.8`, `:dynamic` is default, but the range is dividided into `nthreads()` parts, which is the reason why we do not see an improvement.
583+
584+
Dynamic scheduller is also supported using by `Threads.@spawn` macro. The prototypical approach used for invocation is the fork-join model, where one recursivelly partitions the problems and wait in each thread for the other
581585
```julia
582586
function juliaset_recspawn!(img, c, n, lo=1, hi=n, ntasks=128)
583587
if hi - lo > n/ntasks-1
@@ -605,6 +609,8 @@ end
605609
julia> @btime juliaset_forkjoin(-0.79, 0.15);
606610
10.326 ms (142 allocations: 986.83 KiB)
607611
```
612+
This is so far our fastest construction with speedup `38.932 / 10.326 = 3.77×`.
613+
608614
Unfortunatelly, the `LoggingProfiler` does not handle task migration at the moment, which means that we cannot visualize the results. Due to task switching overhead, increasing the granularity might not pay off.
609615
```julia
610616
4 tasks: 16.262 ms (21 allocations: 978.05 KiB)
@@ -628,10 +634,10 @@ function juliaset_folds(x, y, n=1000, basesize = 2)
628634
return img
629635
end
630636

631-
julia> @btime juliaset(-0.79, 0.15, 1000, juliaset_folds);
632-
10.575 ms (52 allocations: 980.12 KiB)
637+
julia> @btime juliaset_folds(-0.79, 0.15, 1000);
638+
10.253 ms (3960 allocations: 1.24 MiB)
633639
```
634-
where `basesize` is the size of the part, in this case 2 columns.
640+
where `basesize` is the size of the smallest part allocated to a single thread, in this case 2 columns.
635641
```julia
636642
julia> @btime juliaset_folds(-0.79, 0.15, 1000);
637643
10.575 ms (52 allocations: 980.12 KiB)
@@ -650,6 +656,29 @@ julia> @btime juliaset_folds(-0.79, 0.15, 1000);
650656
10.421 ms (3582 allocations: 1.20 MiB)
651657
```
652658

659+
We can identify the best smallest size of the work `basesize` and measure its influence on the time
660+
```julia
661+
map(2 .^ (0:7)) do bs
662+
t = @belapsed juliaset_folds(-0.79, 0.15, 1000, $(bs));
663+
(;basesize = bs, time = t)
664+
end |> DataFrame
665+
```
666+
667+
```julia
668+
Row │ basesize time
669+
│ Int64 Float64
670+
─────┼─────────────────────
671+
11 0.0106803
672+
22 0.010267
673+
34 0.0103081
674+
48 0.0101652
675+
516 0.0100204
676+
632 0.0100097
677+
764 0.0103293
678+
8128 0.0105411
679+
```
680+
We observe that the minimum is for `basesize = 32`, for which we got `3.8932×` speedup.
681+
653682
## Garbage collector is single-threadded
654683
Keep reminded that while threads are very easy very convenient to use, there are use-cases where you might be better off with proccess, even though there will be some communication overhead. One such case happens when you need to allocate and free a lot of memory. This is because Julia's garbage collector is single-threadded. Imagine a task of making histogram of bytes in a directory.
655684
For a fair comparison, we will use `Transducers`, since they offer thread and process based paralelism
@@ -688,9 +717,9 @@ using Transducers
688717
end
689718
files = filter(isfile, readdir("/Users/tomas.pevny/Downloads/", join = true))
690719
@elapsed foldxd(mergewith(+), files |> Map(histfile))
691-
36.224765744
720+
86.44577969
692721
@elapsed foldxt(mergewith(+), files |> Map(histfile))
693-
23.257072067
722+
105.32969331
694723
```
695724
is much better.
696725

docs/src/lecture_10/profile.png

831 Bytes
Loading

0 commit comments

Comments
 (0)