JuliaParallel
diff --git a/‎exercises/mpi/README.md‎
Lines changed: 110 additions & 0 deletions b/‎exercises/mpi/README.md‎
Lines changed: 110 additions & 0 deletions
diff --git a/‎exercises/mpi/diffusion_2d_mpi.jl‎
Lines changed: 124 additions & 0 deletions b/‎exercises/mpi/diffusion_2d_mpi.jl‎
Lines changed: 124 additions & 0 deletions
diff --git a/‎exercises/mpi/get_compute_node_interactive.sh‎
Lines changed: 1 addition & 0 deletions b/‎exercises/mpi/get_compute_node_interactive.sh‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎exercises/mpi/job_mpi_multinode.sh‎
Lines changed: 22 additions & 0 deletions b/‎exercises/mpi/job_mpi_multinode.sh‎
Lines changed: 22 additions & 0 deletions
diff --git a/‎exercises/mpi/job_mpi_singlenode.sh‎
Lines changed: 22 additions & 0 deletions b/‎exercises/mpi/job_mpi_singlenode.sh‎
Lines changed: 22 additions & 0 deletions
@@ -0,0 +1,110 @@
+# Diffusion 2D - MPI
+
+In this part, we want to use MPI (distributed parallelism) to parallelize our
+Diffusion 2D example.
+
+The starting point is (once again) the serial loop version
+[`diffusion_2d_loop.jl`](./../diffusion_2d/diffusion_2d_loop.jl). The file
+[`diffusion_2d_mpi.jl`](./diffusion_2d_mpi.jl) in this folder is a modified
+copy of this variant. While the computational kernel `diffusion_step!` is
+essentially untouched, we included MPI bits at the beginning of the
+`run_diffusion` function and introduced the key function `update_halo!`, which
+is supposed to take care of data exchange between MPI ranks. However, as of
+now, the function isn't communicating anything and it will be (one of) your
+tasks to fix that 😉.
+
+
+## Task 1 - Running the MPI code
+
+Although incomplete from a semantic point of view, the code in
+`diffusion_2d_mpi.jl` is perfectly runnable as is. It won't compute the right
+thing, but it runs 😉. So **let's run it**. But how?
+
+First thing to realize is that, on Perlmutter, **you can't run MPI on a login
+node**. You have two options to work on a compute node:
+
+1) **Interactive session**: You can try to get an interactive session on a
+compute node by running `sh get_compute_node_interactive.sh`. But
+unfortunately, we don't have a node for everyone, so you might not get one
+(Sorry!). **If you can get one**, you can use `mpiexecjl --project -n 4 julia
+diffusion_2d_mpi.jl` to run the code. Alternatively, you can run `sh
+job_mpi_singlenode.sh`.
+
+2) **Compute job**: You can always submit a job that runs the code: `sbatch
+job_mpi_singlenode.sh`. The output will land in `slurm_mpi_singlenode.out`.
+Check out the [Perlmutter cheetsheet](../../help/perlmutter_cheatsheet.md) to
+learn more about jobs.
+
+Irrespective of which option you choose, **go ahead an run the code** (with 4
+MPI ranks).
+
+To see that the code is currently not working properly (in the sense of
+computing the right thing), run `julia --project visualize_mpi.jl` to combine
+the results of different MPI ranks (`*.jld2` files) into a visualization
+(`visualization.png`). Inspect the visualization and notice the undesired dark
+lines.
+
+## Task 2 - Halo exchange
+
+Take a look at the general MPI setup (the beginning of `run_diffusion`) and the
+`update_halo!` function (the bits that are already there) and try to understand
+it.
+
+Afterwards, implement the necessary MPI communication. To that end, find the
+"TODO" block in `update_halo!` and follow the instructions. Note that we want
+to use **non-blocking** communication, i.e. you should use the functions
+`MPI.Irecv` and `MPI.Isend`.
+
+Check that your code is working by comparing the `visualization.png` that you
+get to this (basic "eye test"):
+
+<img src="./solution/visualization_desired.png" width=500px>
+
+> [!NOTE]
+> If the halo exchange fails to send the boundary data, then you will see
+> defects on the MPI rank boundaries, for example: 
+> <img src="./solution/visualization_before.png" width=500px>
+
+## Task 3 - Benchmark
+
+### Part A
+
+Our goal is to perform a rough and basic scaling analysis with 4, 8, and 16 MPI
+ranks distributed across multiple nodes. Specifically, we want to run 4 MPI
+ranks on a node and increase the number of nodes to get up to 16 ranks in
+total.
+
+The file `job_mpi_multinode.sh` is a job script that currently requests a
+single node (see the line `#SBATCH --nodes=1`) that runs 4 MPI ranks (see the
+line `#SBATCH --ntasks-per-node=4`), and then runs our Julia MPI code with
+`do_save=false` for simplicity and `ns=6144`.
+
+Submit this file to SLURM via `sbatch job_mpi_multinode.sh`. Once the job has
+run, the output will land in `slurm_mpi_multinode.sh`. Write the output down
+somewhere (copy & paste), change the number of nodes to 2 (= 8 MPI ranks in
+total) and rerun the experiment. Repeat the same thing, this time requesting 3
+nodes (= 12 MPI ranks in total) and then requesting 4 nodes (= 16 MPI ranks in
+total).
+
+### Part B
+
+Inspect the results that you've obtained and compare them.
+
+**Questions**
+* What do you observe?
+* Is this what you'd expected?
+
+Note that in setting up our MPI ranks, we split our global grid into local
+grids. In the process, the meaning of the input parameter `ns` changed compared
+to previous codes (serial & multithreading). It now determines the resolution
+of the **local grid**  - that each MPI rank is holding - rather than the
+resolution of the global grid. Since we keep `ns` fixed (6144 in
+`job_mpi_multinode.sh`), we thus increase the problem size (the total grid
+resolution) when we increase the number of MPI ranks. This is known as a "weak
+scaling" analysis.
+
+**Question**
+
+* Given the comment above, what does "ideal parallel scaling" mean in the
+context of a "weak scaling" analysis?
+* What do the observed results tell you?
@@ -0,0 +1,124 @@
+# 2D linear diffusion solver - MPI
+using Printf
+using JLD2
+using MPI
+include(joinpath(@__DIR__, "shared.jl"))
+
+# convenience macros simply to avoid writing nested finite-difference expression
+macro qx(ix, iy) esc(:(-D * (C[$ix+1, $iy] - C[$ix, $iy]) / dx)) end
+macro qy(ix, iy) esc(:(-D * (C[$ix, $iy+1] - C[$ix, $iy]) / dy)) end
+
+function diffusion_step!(params, C2, C)
+    (; dx, dy, dt, D) = params
+    for iy in 1:size(C, 2)-2
+        for ix in 1:size(C, 1)-2
+            @inbounds C2[ix+1, iy+1] = C[ix+1, iy+1] - dt * ((@qx(ix+1, iy+1) - @qx(ix, iy+1)) / dx +
+                                                             (@qy(ix+1, iy+1) - @qy(ix+1, iy)) / dy)
+        end
+    end
+    return nothing
+end
+
+# MPI functions
+@views function update_halo!(A, bufs, neighbors, comm)
+    #
+    # !!! TODO
+    #
+    # Complete the halo exchange implementation. Specifically, use non-blocking
+    # MPI communication (Irecv and Isend) at the positions marked by "TODO..." below.
+    #
+    # Help:
+    #   left neighbor:  neighbors.x[1]
+    #   right neighbor: neighbors.x[2]
+    #   up neighbor:    neighbors.y[1]
+    #   down neighbor:  neighbors.y[2]
+    #
+
+    # dim-1 (x)
+    (neighbors.x[1] != MPI.PROC_NULL) && copyto!(bufs.send_1_1, A[2    , :])
+    (neighbors.x[2] != MPI.PROC_NULL) && copyto!(bufs.send_1_2, A[end-1, :])
+
+    reqs = MPI.MultiRequest(4)
+    (neighbors.x[1] != MPI.PROC_NULL) && # TODO... receive from left neighbor into bufs.recv_1_1
+    (neighbors.x[2] != MPI.PROC_NULL) && # TODO... receive from right neighbor into bufs.recv_1_2
+
+    (neighbors.x[1] != MPI.PROC_NULL) && # TODO... send bufs.send_1_1 to left neighbor
+    (neighbors.x[2] != MPI.PROC_NULL) && # TODO... send bufs.send_1_2 to right neighbor
+    MPI.Waitall(reqs) # blocking
+
+    (neighbors.x[1] != MPI.PROC_NULL) && copyto!(A[1  , :], bufs.recv_1_1)
+    (neighbors.x[2] != MPI.PROC_NULL) && copyto!(A[end, :], bufs.recv_1_2)
+
+    # dim-2 (y)
+    (neighbors.y[1] != MPI.PROC_NULL) && copyto!(bufs.send_2_1, A[:, 2    ])
+    (neighbors.y[2] != MPI.PROC_NULL) && copyto!(bufs.send_2_2, A[:, end-1])
+
+    reqs = MPI.MultiRequest(4)
+    (neighbors.y[1] != MPI.PROC_NULL) && # TODO... receive from up neighbor into bufs.recv_2_1
+    (neighbors.y[2] != MPI.PROC_NULL) && # TODO... receive from down neighbor into bufs.recv_2_2
+
+    (neighbors.y[1] != MPI.PROC_NULL) && # TODO... send bufs.send_2_1 to up neighbor
+    (neighbors.y[2] != MPI.PROC_NULL) && # TODO... send bufs.send_2_2 to down neighbor
+    MPI.Waitall(reqs) # blocking
+
+    (neighbors.y[1] != MPI.PROC_NULL) && copyto!(A[:, 1  ], bufs.recv_2_1)
+    (neighbors.y[2] != MPI.PROC_NULL) && copyto!(A[:, end], bufs.recv_2_2)
+    return nothing
+end
+
+function init_bufs(A)
+    return (; send_1_1=zeros(size(A, 2)), send_1_2=zeros(size(A, 2)),
+              send_2_1=zeros(size(A, 1)), send_2_2=zeros(size(A, 1)),
+              recv_1_1=zeros(size(A, 2)), recv_1_2=zeros(size(A, 2)),
+              recv_2_1=zeros(size(A, 1)), recv_2_2=zeros(size(A, 1)))
+end
+
+function run_diffusion(; ns=64, nt=100, do_save=false)
+    MPI.Init()
+    comm      = MPI.COMM_WORLD
+    nprocs    = MPI.Comm_size(comm)
+    dims      = MPI.Dims_create(nprocs, (0, 0)) |> Tuple
+    comm_cart = MPI.Cart_create(comm, dims)
+    me        = MPI.Comm_rank(comm_cart)
+    coords    = MPI.Cart_coords(comm_cart) |> Tuple
+    neighbors = (; x=MPI.Cart_shift(comm_cart, 0, 1), y=MPI.Cart_shift(comm_cart, 1, 1))
+    (me == 0) && println("nprocs = $(nprocs), dims = $dims")
+
+    params = init_params_mpi(; dims, coords, ns, nt, do_save)
+    C, C2  = init_arrays_mpi(params)
+    bufs   = init_bufs(C)
+    t_tic  = 0.0
+    # time loop
+    for it in 1:nt
+        # time after warmup (ignore first 10 iterations)
+        (it == 11) && (t_tic = Base.time())
+        # diffusion
+        diffusion_step!(params, C2, C)
+        update_halo!(C2, bufs, neighbors, comm_cart)
+        C, C2 = C2, C # pointer swap
+    end
+    t_toc = (Base.time() - t_tic)
+    # "master" prints performance
+    (me == 0) && print_perf(params, t_toc)
+    # save to (maybe) visualize later
+    if do_save
+        jldsave(joinpath(@__DIR__, "out_$(me).jld2"); C = Array(C[2:end-1, 2:end-1]), lxy = (; lx=params.L, ly=params.L))
+    end
+    MPI.Finalize()
+    return nothing
+end
+
+# Running things...
+
+# enable save to disk by default
+(!@isdefined do_save) && (do_save = true)
+# enable execution by default
+(!@isdefined do_run) && (do_run = true)
+
+if do_run
+    if !isempty(ARGS)
+        run_diffusion(; ns=parse(Int, ARGS[1]), do_save)
+    else
+        run_diffusion(; ns=256, do_save)
+    end
+end
@@ -0,0 +1 @@
+salloc --nodes 1 --cpus-per-task=1 --qos interactive --time 00:45:00 --constraint cpu --ntasks-per-node=4 --account=ntrain1
@@ -0,0 +1,22 @@
+#!/bin/bash
+#SBATCH -A ntrain1
+#SBATCH -C cpu
+#SBATCH -q regular
+#SBATCH --output=slurm_mpi_multinode.out
+#SBATCH --time=00:05:00
+#SBATCH --nodes=4
+#SBATCH --ntasks=16
+
+# Load the latest Julia Module
+ml load julia
+
+# This will load the activate.sh in the root path of this repository
+# IMPORTATION: for this relative path to work, you need to be in this
+# directory when running `sbatch`
+source ../../activate.sh
+
+# Run the Julia code -- we're usign `srun` to launch Julia. This is necessary
+# to configure MPI. If you tried to use `MPI.Init()` outside of an srun, then
+# the program will crash. Note also that you can't run an srun _insite_ of
+# another srun.
+srun julia -e 'do_save=true; include("diffusion_2d_mpi.jl");'
@@ -0,0 +1,22 @@
+#!/bin/bash
+#SBATCH -A ntrain1
+#SBATCH -C cpu
+#SBATCH -q regular
+#SBATCH --output=slurm_mpi_singlenode.out
+#SBATCH --time=00:05:00
+#SBATCH --nodes=1
+#SBATCH --ntasks=4
+
+# Load the latest Julia Module
+ml load julia
+
+# This will load the activate.sh in the root path of this repository
+# IMPORTATION: for this relative path to work, you need to be in this
+# directory when running `sbatch`
+source ../../activate.sh
+
+# Run the Julia code -- we're usign `srun` to launch Julia. This is necessary
+# to configure MPI. If you tried to use `MPI.Init()` outside of an srun, then
+# the program will crash. Note also that you can't run an srun _insite_ of
+# another srun.
+srun julia diffusion_2d_mpi.jl
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+salloc --nodes 1 --cpus-per-task=1 --qos interactive --time 00:45:00 --constraint cpu --ntasks-per-node=4 --account=ntrain1`