You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/src/lecture_10/lecture.md
+39-10Lines changed: 39 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -464,7 +464,9 @@ a[:a] === b[:a]
464
464
end
465
465
@everywhereshow_secret()
466
466
467
-
remotecall_fetch(g ->eval(:(g =$(g))), 2, g)
467
+
for i inworkers()
468
+
remotecall_fetch(g ->eval(:(g =$(g))), i, g)
469
+
end
468
470
@everywhereshow_secret()
469
471
```
470
472
which is implemented in the `ParallelDataTransfer.jl` with other variants, but in general, this construct should be avoided.
@@ -480,7 +482,7 @@ end
480
482
Our advices earned by practice are:
481
483
- to have shared directory (shared home) with code and to share the location of packages
482
484
- to place all code for workers to one file, let's call it `worker.jl` (author of this includes the code for master as well).
483
-
- put to the beggining of `worker.jl` code activating specified environment as
485
+
- put to the beggining of `worker.jl` code activating specified environment as (or specify environmnet for all workers in environment variable as `export JULIA_PROJECT="$PWD"`)
484
486
```julia
485
487
using Pkg
486
488
Pkg.activate(@__DIR__)
@@ -550,7 +552,7 @@ end
550
552
with benchmark
551
553
```
552
554
julia> @btime juliaset_static(-0.79, 0.15, 1000);
553
-
16.206 ms (26 allocations: 978.42 KiB)
555
+
15.751 ms (27 allocations: 978.75 KiB)
554
556
```
555
557
Although we have used four-threads, and the communication overhead should be next to zero, the speed improvement is ``2.4``. Why is that?
556
558
@@ -560,7 +562,7 @@ using LoggingProfiler
560
562
functionjuliaset_static(x, y, n=1000)
561
563
c = x + y*im
562
564
img =Array{UInt8,2}(undef,n,n)
563
-
Threads.@threadsfor j in1:n
565
+
Threads.@threads:dynamicfor j in1:n
564
566
LoggingProfiler.@recordfunjuliaset_column!(img, c, n, j)
From the visualization of the profiler we can see not all threads were working the same time. Thread 1 and 4 were working less that Thread 2 and 3. The reason is that the static scheduller partition the total number of columns (1000) into equal parts, where the total number of parts is equal to the number of threads, and assign each to a single thread. In our case, we will have four parts each of size 250. Since execution time of computing value of each pixel is not the same, threads with a lot zero iterations will finish considerably faster. This is the incarnation of one of the biggest problems in multi-threadding / schedulling. A contemprary approach is to switch to dynamic schedulling, which divides the problem into smaller parts, and when a thread is finished with one part, it assigned new not-yet computed part.
579
581
580
-
Dynamic scheduller is supported using `Threads.@spawn` macro. The prototypical approach is the fork-join model, where one recursivelly partitions the problems and wait in each thread for the other
582
+
From 1.5, one can specify the scheduller for `Threads.@thread [scheduller] for` construct to be either `:static` and / or `:dynamic`. The `:dynamic` is compatible with the `partr` dynamic scheduller. From `1.8`, `:dynamic` is default, but the range is dividided into `nthreads()` parts, which is the reason why we do not see an improvement.
583
+
584
+
Dynamic scheduller is also supported using by `Threads.@spawn` macro. The prototypical approach used for invocation is the fork-join model, where one recursivelly partitions the problems and wait in each thread for the other
581
585
```julia
582
586
functionjuliaset_recspawn!(img, c, n, lo=1, hi=n, ntasks=128)
583
587
if hi - lo > n/ntasks-1
@@ -605,6 +609,8 @@ end
605
609
julia>@btimejuliaset_forkjoin(-0.79, 0.15);
606
610
10.326 ms (142 allocations:986.83 KiB)
607
611
```
612
+
This is so far our fastest construction with speedup `38.932 / 10.326 = 3.77×`.
613
+
608
614
Unfortunatelly, the `LoggingProfiler` does not handle task migration at the moment, which means that we cannot visualize the results. Due to task switching overhead, increasing the granularity might not pay off.
609
615
```julia
610
616
4 tasks:16.262 ms (21 allocations:978.05 KiB)
@@ -628,10 +634,10 @@ function juliaset_folds(x, y, n=1000, basesize = 2)
We can identify the best smallest size of the work `basesize` and measure its influence on the time
660
+
```julia
661
+
map(2.^ (0:7)) do bs
662
+
t =@belapsedjuliaset_folds(-0.79, 0.15, 1000, $(bs));
663
+
(;basesize = bs, time = t)
664
+
end|> DataFrame
665
+
```
666
+
667
+
```julia
668
+
Row │ basesize time
669
+
│ Int64 Float64
670
+
─────┼─────────────────────
671
+
1 │ 10.0106803
672
+
2 │ 20.010267
673
+
3 │ 40.0103081
674
+
4 │ 80.0101652
675
+
5 │ 160.0100204
676
+
6 │ 320.0100097
677
+
7 │ 640.0103293
678
+
8 │ 1280.0105411
679
+
```
680
+
We observe that the minimum is for `basesize = 32`, for which we got `3.8932×` speedup.
681
+
653
682
## Garbage collector is single-threadded
654
683
Keep reminded that while threads are very easy very convenient to use, there are use-cases where you might be better off with proccess, even though there will be some communication overhead. One such case happens when you need to allocate and free a lot of memory. This is because Julia's garbage collector is single-threadded. Imagine a task of making histogram of bytes in a directory.
655
684
For a fair comparison, we will use `Transducers`, since they offer thread and process based paralelism
0 commit comments