Skip to content

Using ReFrame to test an important number of nodes using --distribute and avoid massive queing #3586

@MarcoMagl

Description

@MarcoMagl

ReFrame Version: 4.8.4
Python version: 3.12.3
Scheduler: slurm

What I want to achieve: I want to test all nodes of a node list. I have >20 tests per node and 500 nodes

Problem:

  • Some of the nodes I want to test are being allocated already by some users.
  • If I launch my tests with --distribute=idle, the nodes already allocated will not be tested
  • If I launch my tests with --distribute=avail and I am not lucky, ReFrame might queue a lot of tests on the nodes allocated by other users. At some point, even if system.partitions.max_jobs is high enough, I will reach the limit of job that I am allowed to submit (MaxSubmit in sacctmgr show association). I end up with a lot of jobs queued because ReFrame tried to launch the tests on the nodes that were allocated
squeue -u $USER --state=pending | wc -l
568

but almost no test is running

squeue -u $USER --state=running | wc -l
2

in short: my pipeline of tests is totally stuck even if some nodes are idle! To rephrase that and present the problem from a different perspective: ReFrame submitted the jobs to the nodelist in such a way that it did not prioritize the runs on nodes that were idle. As some nodes were allocated and ReFrame queued jobs on them, I ended up reaching the maximum number of jobs allowed for my slurm account.

In the issues, I did not find a similar problem.
Has anyone an idea how to overcome this issue? That would be really helpful!

Metadata

Metadata

Assignees

No one assigned

    Projects

    Status

    Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions