Intermittent network issues with V100 GPU nodes

Opened on Friday 28th July 2023, last updated November 02, 2023 13:51

Resolved — The V100 nodes are fully stable.

Posted November 02, 2023 13:51 by Fokke

Monitoring — Connection not perfect, but jobs are no longer being killed.

Posted August 31, 2023 09:57 by Cristian

Investigating — As a temporary workaround, we've increased a timeout value that the scheduler uses to determine whether or not a node is unreachable. This should, hopefully, prevent jobs running on the V100 nodes from being killed.

Posted July 28, 2023 11:18 by Bob

Investigating — Due to intermittent network issues between the Slurm scheduler and the V100 GPU nodes, jobs may have been killed by the scheduler due to a "node failure". We're looking into the issue.

Posted July 28, 2023 11:16 by Bob