Skip to content

Commit 2bd5328

Browse files
committed
add mpi and distributed
1 parent 1a4c291 commit 2bd5328

30 files changed

+5809
-0
lines changed

exercises/mpi/README.md

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# Diffusion 2D - MPI
2+
3+
In this part, we want to use MPI (distributed parallelism) to parallelize our
4+
Diffusion 2D example.
5+
6+
The starting point is (once again) the serial loop version
7+
[`diffusion_2d_loop.jl`](./../diffusion_2d/diffusion_2d_loop.jl). The file
8+
[`diffusion_2d_mpi.jl`](./diffusion_2d_mpi.jl) in this folder is a modified
9+
copy of this variant. While the computational kernel `diffusion_step!` is
10+
essentially untouched, we included MPI bits at the beginning of the
11+
`run_diffusion` function and introduced the key function `update_halo!`, which
12+
is supposed to take care of data exchange between MPI ranks. However, as of
13+
now, the function isn't communicating anything and it will be (one of) your
14+
tasks to fix that 😉.
15+
16+
17+
## Task 1 - Running the MPI code
18+
19+
Although incomplete from a semantic point of view, the code in
20+
`diffusion_2d_mpi.jl` is perfectly runnable as is. It won't compute the right
21+
thing, but it runs 😉. So **let's run it**. But how?
22+
23+
First thing to realize is that, on Perlmutter, **you can't run MPI on a login
24+
node**. You have two options to work on a compute node:
25+
26+
1) **Interactive session**: You can try to get an interactive session on a
27+
compute node by running `sh get_compute_node_interactive.sh`. But
28+
unfortunately, we don't have a node for everyone, so you might not get one
29+
(Sorry!). **If you can get one**, you can use `mpiexecjl --project -n 4 julia
30+
diffusion_2d_mpi.jl` to run the code. Alternatively, you can run `sh
31+
job_mpi_singlenode.sh`.
32+
33+
2) **Compute job**: You can always submit a job that runs the code: `sbatch
34+
job_mpi_singlenode.sh`. The output will land in `slurm_mpi_singlenode.out`.
35+
Check out the [Perlmutter cheetsheet](../../help/perlmutter_cheatsheet.md) to
36+
learn more about jobs.
37+
38+
Irrespective of which option you choose, **go ahead an run the code** (with 4
39+
MPI ranks).
40+
41+
To see that the code is currently not working properly (in the sense of
42+
computing the right thing), run `julia --project visualize_mpi.jl` to combine
43+
the results of different MPI ranks (`*.jld2` files) into a visualization
44+
(`visualization.png`). Inspect the visualization and notice the undesired dark
45+
lines.
46+
47+
## Task 2 - Halo exchange
48+
49+
Take a look at the general MPI setup (the beginning of `run_diffusion`) and the
50+
`update_halo!` function (the bits that are already there) and try to understand
51+
it.
52+
53+
Afterwards, implement the necessary MPI communication. To that end, find the
54+
"TODO" block in `update_halo!` and follow the instructions. Note that we want
55+
to use **non-blocking** communication, i.e. you should use the functions
56+
`MPI.Irecv` and `MPI.Isend`.
57+
58+
Check that your code is working by comparing the `visualization.png` that you
59+
get to this (basic "eye test"):
60+
61+
<img src="./solution/visualization_desired.png" width=500px>
62+
63+
> [!NOTE]
64+
> If the halo exchange fails to send the boundary data, then you will see
65+
> defects on the MPI rank boundaries, for example:
66+
> <img src="./solution/visualization_before.png" width=500px>
67+
68+
## Task 3 - Benchmark
69+
70+
### Part A
71+
72+
Our goal is to perform a rough and basic scaling analysis with 4, 8, and 16 MPI
73+
ranks distributed across multiple nodes. Specifically, we want to run 4 MPI
74+
ranks on a node and increase the number of nodes to get up to 16 ranks in
75+
total.
76+
77+
The file `job_mpi_multinode.sh` is a job script that currently requests a
78+
single node (see the line `#SBATCH --nodes=1`) that runs 4 MPI ranks (see the
79+
line `#SBATCH --ntasks-per-node=4`), and then runs our Julia MPI code with
80+
`do_save=false` for simplicity and `ns=6144`.
81+
82+
Submit this file to SLURM via `sbatch job_mpi_multinode.sh`. Once the job has
83+
run, the output will land in `slurm_mpi_multinode.sh`. Write the output down
84+
somewhere (copy & paste), change the number of nodes to 2 (= 8 MPI ranks in
85+
total) and rerun the experiment. Repeat the same thing, this time requesting 3
86+
nodes (= 12 MPI ranks in total) and then requesting 4 nodes (= 16 MPI ranks in
87+
total).
88+
89+
### Part B
90+
91+
Inspect the results that you've obtained and compare them.
92+
93+
**Questions**
94+
* What do you observe?
95+
* Is this what you'd expected?
96+
97+
Note that in setting up our MPI ranks, we split our global grid into local
98+
grids. In the process, the meaning of the input parameter `ns` changed compared
99+
to previous codes (serial & multithreading). It now determines the resolution
100+
of the **local grid** - that each MPI rank is holding - rather than the
101+
resolution of the global grid. Since we keep `ns` fixed (6144 in
102+
`job_mpi_multinode.sh`), we thus increase the problem size (the total grid
103+
resolution) when we increase the number of MPI ranks. This is known as a "weak
104+
scaling" analysis.
105+
106+
**Question**
107+
108+
* Given the comment above, what does "ideal parallel scaling" mean in the
109+
context of a "weak scaling" analysis?
110+
* What do the observed results tell you?

exercises/mpi/diffusion_2d_mpi.jl

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
# 2D linear diffusion solver - MPI
2+
using Printf
3+
using JLD2
4+
using MPI
5+
include(joinpath(@__DIR__, "shared.jl"))
6+
7+
# convenience macros simply to avoid writing nested finite-difference expression
8+
macro qx(ix, iy) esc(:(-D * (C[$ix+1, $iy] - C[$ix, $iy]) / dx)) end
9+
macro qy(ix, iy) esc(:(-D * (C[$ix, $iy+1] - C[$ix, $iy]) / dy)) end
10+
11+
function diffusion_step!(params, C2, C)
12+
(; dx, dy, dt, D) = params
13+
for iy in 1:size(C, 2)-2
14+
for ix in 1:size(C, 1)-2
15+
@inbounds C2[ix+1, iy+1] = C[ix+1, iy+1] - dt * ((@qx(ix+1, iy+1) - @qx(ix, iy+1)) / dx +
16+
(@qy(ix+1, iy+1) - @qy(ix+1, iy)) / dy)
17+
end
18+
end
19+
return nothing
20+
end
21+
22+
# MPI functions
23+
@views function update_halo!(A, bufs, neighbors, comm)
24+
#
25+
# !!! TODO
26+
#
27+
# Complete the halo exchange implementation. Specifically, use non-blocking
28+
# MPI communication (Irecv and Isend) at the positions marked by "TODO..." below.
29+
#
30+
# Help:
31+
# left neighbor: neighbors.x[1]
32+
# right neighbor: neighbors.x[2]
33+
# up neighbor: neighbors.y[1]
34+
# down neighbor: neighbors.y[2]
35+
#
36+
37+
# dim-1 (x)
38+
(neighbors.x[1] != MPI.PROC_NULL) && copyto!(bufs.send_1_1, A[2 , :])
39+
(neighbors.x[2] != MPI.PROC_NULL) && copyto!(bufs.send_1_2, A[end-1, :])
40+
41+
reqs = MPI.MultiRequest(4)
42+
(neighbors.x[1] != MPI.PROC_NULL) && # TODO... receive from left neighbor into bufs.recv_1_1
43+
(neighbors.x[2] != MPI.PROC_NULL) && # TODO... receive from right neighbor into bufs.recv_1_2
44+
45+
(neighbors.x[1] != MPI.PROC_NULL) && # TODO... send bufs.send_1_1 to left neighbor
46+
(neighbors.x[2] != MPI.PROC_NULL) && # TODO... send bufs.send_1_2 to right neighbor
47+
MPI.Waitall(reqs) # blocking
48+
49+
(neighbors.x[1] != MPI.PROC_NULL) && copyto!(A[1 , :], bufs.recv_1_1)
50+
(neighbors.x[2] != MPI.PROC_NULL) && copyto!(A[end, :], bufs.recv_1_2)
51+
52+
# dim-2 (y)
53+
(neighbors.y[1] != MPI.PROC_NULL) && copyto!(bufs.send_2_1, A[:, 2 ])
54+
(neighbors.y[2] != MPI.PROC_NULL) && copyto!(bufs.send_2_2, A[:, end-1])
55+
56+
reqs = MPI.MultiRequest(4)
57+
(neighbors.y[1] != MPI.PROC_NULL) && # TODO... receive from up neighbor into bufs.recv_2_1
58+
(neighbors.y[2] != MPI.PROC_NULL) && # TODO... receive from down neighbor into bufs.recv_2_2
59+
60+
(neighbors.y[1] != MPI.PROC_NULL) && # TODO... send bufs.send_2_1 to up neighbor
61+
(neighbors.y[2] != MPI.PROC_NULL) && # TODO... send bufs.send_2_2 to down neighbor
62+
MPI.Waitall(reqs) # blocking
63+
64+
(neighbors.y[1] != MPI.PROC_NULL) && copyto!(A[:, 1 ], bufs.recv_2_1)
65+
(neighbors.y[2] != MPI.PROC_NULL) && copyto!(A[:, end], bufs.recv_2_2)
66+
return nothing
67+
end
68+
69+
function init_bufs(A)
70+
return (; send_1_1=zeros(size(A, 2)), send_1_2=zeros(size(A, 2)),
71+
send_2_1=zeros(size(A, 1)), send_2_2=zeros(size(A, 1)),
72+
recv_1_1=zeros(size(A, 2)), recv_1_2=zeros(size(A, 2)),
73+
recv_2_1=zeros(size(A, 1)), recv_2_2=zeros(size(A, 1)))
74+
end
75+
76+
function run_diffusion(; ns=64, nt=100, do_save=false)
77+
MPI.Init()
78+
comm = MPI.COMM_WORLD
79+
nprocs = MPI.Comm_size(comm)
80+
dims = MPI.Dims_create(nprocs, (0, 0)) |> Tuple
81+
comm_cart = MPI.Cart_create(comm, dims)
82+
me = MPI.Comm_rank(comm_cart)
83+
coords = MPI.Cart_coords(comm_cart) |> Tuple
84+
neighbors = (; x=MPI.Cart_shift(comm_cart, 0, 1), y=MPI.Cart_shift(comm_cart, 1, 1))
85+
(me == 0) && println("nprocs = $(nprocs), dims = $dims")
86+
87+
params = init_params_mpi(; dims, coords, ns, nt, do_save)
88+
C, C2 = init_arrays_mpi(params)
89+
bufs = init_bufs(C)
90+
t_tic = 0.0
91+
# time loop
92+
for it in 1:nt
93+
# time after warmup (ignore first 10 iterations)
94+
(it == 11) && (t_tic = Base.time())
95+
# diffusion
96+
diffusion_step!(params, C2, C)
97+
update_halo!(C2, bufs, neighbors, comm_cart)
98+
C, C2 = C2, C # pointer swap
99+
end
100+
t_toc = (Base.time() - t_tic)
101+
# "master" prints performance
102+
(me == 0) && print_perf(params, t_toc)
103+
# save to (maybe) visualize later
104+
if do_save
105+
jldsave(joinpath(@__DIR__, "out_$(me).jld2"); C = Array(C[2:end-1, 2:end-1]), lxy = (; lx=params.L, ly=params.L))
106+
end
107+
MPI.Finalize()
108+
return nothing
109+
end
110+
111+
# Running things...
112+
113+
# enable save to disk by default
114+
(!@isdefined do_save) && (do_save = true)
115+
# enable execution by default
116+
(!@isdefined do_run) && (do_run = true)
117+
118+
if do_run
119+
if !isempty(ARGS)
120+
run_diffusion(; ns=parse(Int, ARGS[1]), do_save)
121+
else
122+
run_diffusion(; ns=256, do_save)
123+
end
124+
end
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
salloc --nodes 1 --cpus-per-task=1 --qos interactive --time 00:45:00 --constraint cpu --ntasks-per-node=4 --account=ntrain1

exercises/mpi/job_mpi_multinode.sh

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
#!/bin/bash
2+
#SBATCH -A ntrain1
3+
#SBATCH -C cpu
4+
#SBATCH -q regular
5+
#SBATCH --output=slurm_mpi_multinode.out
6+
#SBATCH --time=00:05:00
7+
#SBATCH --nodes=4
8+
#SBATCH --ntasks=16
9+
10+
# Load the latest Julia Module
11+
ml load julia
12+
13+
# This will load the activate.sh in the root path of this repository
14+
# IMPORTATION: for this relative path to work, you need to be in this
15+
# directory when running `sbatch`
16+
source ../../activate.sh
17+
18+
# Run the Julia code -- we're usign `srun` to launch Julia. This is necessary
19+
# to configure MPI. If you tried to use `MPI.Init()` outside of an srun, then
20+
# the program will crash. Note also that you can't run an srun _insite_ of
21+
# another srun.
22+
srun julia -e 'do_save=true; include("diffusion_2d_mpi.jl");'
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
#!/bin/bash
2+
#SBATCH -A ntrain1
3+
#SBATCH -C cpu
4+
#SBATCH -q regular
5+
#SBATCH --output=slurm_mpi_singlenode.out
6+
#SBATCH --time=00:05:00
7+
#SBATCH --nodes=1
8+
#SBATCH --ntasks=4
9+
10+
# Load the latest Julia Module
11+
ml load julia
12+
13+
# This will load the activate.sh in the root path of this repository
14+
# IMPORTATION: for this relative path to work, you need to be in this
15+
# directory when running `sbatch`
16+
source ../../activate.sh
17+
18+
# Run the Julia code -- we're usign `srun` to launch Julia. This is necessary
19+
# to configure MPI. If you tried to use `MPI.Init()` outside of an srun, then
20+
# the program will crash. Note also that you can't run an srun _insite_ of
21+
# another srun.
22+
srun julia diffusion_2d_mpi.jl

0 commit comments

Comments
 (0)