Affected services:

  • Login nodes
  • Interactive nodes
  • Interactive GPU nodes
  • Web portal
  • Storage
  • Scheduler
  • Compute nodes

Power maintenance affecting Peregrine cluster

Scheduled for Friday 16th October 2020 at 16:00 (Amsterdam)

Schedule/description of work

The data center at DUO where Peregrine is running will perform maintenance on their entire power infrastructure from the 17th till the 18th of October. This will lead to a power loss for the cluster on the 17th and the 18th of October. The cluster will be brought down the day before, on the 16th, at 16:00 and if all goes well will be back up on the 19th from 09:00 in the morning.

Scheduled start time
October 16, 2020 16:00
Duration
2 days 17 hours
Status
Finished

Updates

The portal (portal.hpc.rug.nl) has been restarted as well. Everything, except for a few compute nodes, should be back online. If you encounter any issues, please let us know,

Posted by Bob

The issues with the module environment have been solved. We removed the reservation in the scheduler, and jobs have started running again.

Posted by Bob

Because of these issues with the module environment, the scheduler will not start any new jobs (as most of them will crash right away) until the issues have been resolved.

Posted by Bob

The majority of the nodes is back online, including the login nodes. The compute nodes just started running jobs again, but we found an issue with the software modules not being available. Unfortunately, jobs depending on modules may have failed because of this. We're still looking into this, and hope to fix this as soon as possible.

Posted by Bob