Affected services:

  • Compute nodes

Intermittent network issues with V100 GPU nodes

Opened on Friday 28th July 2023, last updated

Resolved — The V100 nodes are fully stable.

Posted by Fokke

Monitoring — Connection not perfect, but jobs are no longer being killed.

Posted by Cristian

Investigating — As a temporary workaround, we've increased a timeout value that the scheduler uses to determine whether or not a node is unreachable. This should, hopefully, prevent jobs running on the V100 nodes from being killed.

Posted by Bob

Investigating — Due to intermittent network issues between the Slurm scheduler and the V100 GPU nodes, jobs may have been killed by the scheduler due to a "node failure". We're looking into the issue.

Posted by Bob