-
Notifications
You must be signed in to change notification settings - Fork 8
Open
Labels
Description
When a subprocess managed by taskgraph is killed by the operating system, the pool will automatically spawn a new process, but not the events that have been allocated to that process. Because of this, the graph deadlocks waiting on events that will never be triggered.
To reproduce:
- Create a new graph in multiprocessed mode (
n_workers >= 1) - Execute a task
- Kill that task before it completes
- Observe graph hanging
A practical way to trigger this is to use a memory-constrained environment such as Sherlock. On Sherlock, just make sure we have at least 1 task that uses more memory than we have requested for the SLURM job.
Although I suppose it might be ideal to have the appropriate events recreated so the graph can continue to execute, I think it might be better to simply detect that the process has been terminated and then terminate the graph.