-
Notifications
You must be signed in to change notification settings - Fork 117
Description
I have an OSU test that is supposed to test point-to-point GPU communication. Essentially, it sets num_tasks=2, and num_tasks_per_node=2. The job script produced is:
#!/bin/bash
#SBATCH --job-name="rfm_EESSI_OSU_pt2pt_GPU_87fbf5ce"
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=32
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p gpu_h100
#SBATCH --export=None
#SBATCH --mem=737280M
#SBATCH --gpus-per-node=4
source /cvmfs/software.eessi.io/versions/2023.06/init/bash
module load OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0
...
mpirun -np 2 osu_bw -m 4194304 -x 5 -i 10 -c -d cuda D D
...
I saw strongly varying performance: either 25 GB/s, or 120 GB/s. Based on our interconnect and the connectivity between GPUs, 25 GB/s matches our internode GPU to GPU performance, whereas 120 GB/s matches the intranode GPU to GPU performance. Checking the run-report, I saw:
"outputdir": "/home/jenkins/EESSI/reframe_CI_runs/output/snellius/gpu_H100/default/EESSI_OSU_pt2pt_GPU_87fbf5ce",
...
"job_nodelist": [
"gcn114",
"gcn149"
],
I.e. this particular test was being scheduled to two nodes. I was a bit surprised by this behavior, but reading the SLURM documentation carefully, it becomes clear why:
--ntasks-per-node=
Request that ntasks be invoked on each node. If used with the --ntasks option, the --ntasks option will take precedence and the --ntasks-per-node will be treated as a maximum count of tasks per node. Meant to be used with the --nodes option. This is related to --cpus-per-task=ncpus, but does not require knowledge of the actual number of cpus on each node. In some cases, it is more convenient to be able to request that no more than a specific number of tasks be invoked on each node. Examples of this include submitting a hybrid MPI/OpenMP app where only one MPI "task/rank" should be assigned to each node while allowing the OpenMP portion to utilize all of the parallelism present in the node, or submitting a single setup/cleanup/monitoring job to each node of a pre-existing allocation as one step in a larger job script.
Note in particular
If used with the --ntasks option, the --ntasks option will take precedence and the --ntasks-per-node will be treated as a maximum count of tasks per node. Meant to be used with the --nodes option.
I.e. they basically say: you should use it with --nodes, and if you use it with --ntasks instead, it's considered a maximum count of tasks per node. That gives SLURM the liberty of actually scheduling 2 nodes with 1 tasks per node - which is what is happening in my case. From a regression testing perspective, this is clearly undesirable, as it leads to unexpected changes in performance from one run to the next. Actually, I'd consider it a bug, because the ReFrame docs specify:
num_tasks_per_node= None
Number of tasks per node required by this test.
Which suggest thats exactly the number of tasks per node you'll get (and not a maximum, like it is for SLURM), but then by specifying --ntasks and --ntasks-per-node (and not --nnodes), ReFrame doesn't give the right instructions to the SLURM backend in order to trigger the promised behavior.
Now, I know the use_nodes_option exists, and it does resolve the issue, but it's default value is False. I'd consider it preferable to change the default to True, so that the behavior of num_tasks_per_node as it is documented in the ReFrame docs matches with the behavior it triggers on the SLURM side.
Metadata
Metadata
Assignees
Type
Projects
Status