Affected services:

  • Storage

Storage issues affecting jobs and login attempts

Opened on Thursday 26th August 2021, last updated

Resolved — The problem hasn't returned, so we're closing the issue.

Posted by Bob

Monitoring — We may have found the cause of the issues, and we will keep monitoring the stability of the storage.

Posted by Bob

Identified — The file systems are back online, and we're now investigating why the issue keeps popping up.

Posted by Bob

Investigating — Unfortunately, the issue has returned, and it's currently not possible to log in. We are looking into it.

Posted by Bob

Resolved — It looks like the storage is stable now, so we’re closing this issue.

Posted by Bob

Monitoring — The metadata server was in an unhealthy state. We've performed a health check to fix this, restarted the services, and remounted the storage. All file systems are back online, and you should be able to log in again and access your files. We are still monitoring the situation. If you encounter any issues, please let us know. Also, if you had any open files (e.g. running jobs that were writing to files), make sure to double-check these files.

Posted by Bob

Investigating — We are again seeing the same kind of issue with the storage as yesterday. The access to the storage is extremely slow, making it hard to log in and access files. We are investigating the issue again.

Posted by Bob

Monitoring — We've resolved the issue with the storage. You should be able to log in again, and jobs should continue running. While we're still monitoring things, we will leave this issue open.

Posted by Bob

Investigating — Due to storage issues it may currently be hard/impossible to log in to Peregrine, and jobs may have issues writing to output files. Only some nodes (including the login node) seem to be affected by this. We're looking into the issue, and we'll update this issue when we have updates.

Posted by Bob