Monitoring —
Connection not perfect, but jobs are no longer being killed.
Posted
by Cristian
Investigating —
As a temporary workaround, we've increased a timeout value that the scheduler uses to determine whether or not a node is unreachable. This should, hopefully, prevent jobs running on the V100 nodes from being killed.
Posted
by Bob
Investigating —
Due to intermittent network issues between the Slurm scheduler and the V100 GPU nodes, jobs may have been killed by the scheduler due to a "node failure". We're looking into the issue.