Affected services:

  • Compute nodes

Cluster and storage down because of power outage

Opened on Monday 6th September 2021, last updated

Resolved — As everything is back online, except for Data Handling Silo 2 (which is not expected to be back online anytime soon), we're closing this issue for now. The status of Silo 2 will stay on "Major outage" until it is available again.

Posted by Bob

Monitoring — All compute nodes are available again. Silo 2 of the data handling storage is still not available.

Posted by Fokke

Monitoring — Most of the cluster is operational again. The GPU nodes are still offline because they have a dependency on an external file system which is still being checked.

Posted by Fokke

Monitoring — Most of the cluster is operational again. The GPU nodes are still offline because they have a dependency on an external file system which is still being checked.

Posted by Fokke

Investigating — There was a power outage in the DUO data center. We are currently bringing the cluster back on line. Running jobs on the nodes that were rebooted have been lost.

Posted by Fokke

Investigating — The cluster nodes suddenly rebooted. We are investigating the issue.

Posted by Fokke