Affected services:

  • Compute nodes

Segmentation faults on V100 GPU nodes

Opened on Friday 14th August 2020, last updated

Resolved — All V100 nodes (except for a few with other issues) have been patched, so we're closing this issue.

Posted by Bob

Monitoring — Some jobs have been running on the patched nodes, and the issue hasn't popped up anymore. The majority of the V100 nodes has been updated and is available again. The others will follow soon.

Posted by Bob

Monitoring — We installed a patched version of the filesystem client on some of the V100 nodes, and this has fixed the issue. These nodes have been put back into the queue. Once the jobs on the other V100 nodes have finished, we will upgrade those as well.

Posted by Bob

Identified — The issue seems to be caused by a bug in the filesystem client on the GPU nodes: https://jira.whamcloud.com/browse/LU-13137. We're going to downgrade to a previous version of the client. Because of this, all V100 nodes are currently not starting new jobs.

Posted by Bob

Investigating — We're getting reports that GPU jobs crash, often after approximately an hour, with a segmentation fault. We've been able to reproduce this issue and are currently looking for a solution.

Posted by Bob