Affected services:

  • Storage
  • Scheduler
  • Compute nodes

Issues on /scratch

Opened on Wednesday 6th December 2023, last updated

Resolved — We have identified and isolated the cause of this problem and haven't experienced issues since. This has now been resolved

Posted by Pedro

Monitoring — The file system hasn't crashed during the holidays, but we're still monitoring it.

Posted by Bob

Investigating — The file system is available again, jobs have been resumed, and all partitions have been reactivated.

Posted by Bob

Investigating — The file system collapsed again, and we're currently trying to bring it back. Meanwhile, the partitions have been disabled and all running jobs have been suspended again.

Posted by Bob

Investigating — The /scratch file system is available again, and jobs have been resumed.

Posted by Bob

Investigating — Unfortunately, the issue just resurfaced again.

Posted by Bob

Investigating — The Lustre software has been upgraded, the storage has been remounted, and the nodes have been made available again. So far, we haven't seen any errors or issues on the storage servers, but we keep monitoring them closely.

Posted by Bob

Investigating — We will upgrade the Lustre /scratch file system software from version 2.15.2 to version 2.15.3 to make sure we are not affected by the bugs that have been patched in the latest release.

Posted by Fokke

Investigating — As part of ongoing maintenance, /scratch is currently not accessible. We will be carrying updates to storage infrastructure software which are likely to require rebooting. We apologise for the inconvenience.

Posted by Pedro

Investigating — We've restarted the storage controllers yesterday to make sure that these were not causing issues. We've also updated their firmware. Unfortunately we still encountered the same issue again yesterday evening, where the file system processes accessing the storage device hang and won't respond any more. For now a failover to the 2nd storage server fixed the issue, but this only works when there is still a healthy storage server to failover to.

Posted by Fokke

Investigating — We ran short jobs over the weekend succesfully, but when gearing up production the /scratch file system failed again. We are now going to reboot the storage controllers (although they don't report issues) to make sure those are working properly.

Posted by Fokke

Investigating — The nodes have been rebooted and the storage has been reconfigured and remounted. We're once again going to gradually make some compute nodes available, and we keep monitoring (the load on) the storage servers.

Posted by Bob

Investigating — In order to reconfigure the storage, all nodes have to rebooted again. This means that any running jobs will be lost, unfortunately.

Posted by Bob

Investigating — The storage unfortunately failed again after a few hours, and we need to check and possibly adjust some configuration settings.

Posted by Fokke

Monitoring — After another round of maintenance and recovery, in addition to putting some possible high-load jobs on hold, we are now gradually bringing nodes back up and allowing new jobs to be started.

Posted by Pedro

Investigating — Unfortunately, this issue has resurfaced today and /scratch is no longer accessible. To perform additional maintenance, new jobs have been put on hold, and will stay pending on queue until the problem is resolved. Jobs that had already started running have not been paused and are still running.

Posted by Pedro

Monitoring — All nodes have been rebooted, and the /scratch file system has been remounted. We're slowly making some compute nodes available again, and we will keep monitoring the stability of the file system.

Posted by Bob

Investigating — Due to the storage issues we're seeing a lot of zombie processes on the compute nodes. In order to get rid of these, we are going to reboot all compute nodes. Unfortunately, this means that the few jobs that were still running will be cancelled as well.

Posted by Bob

Investigating — The storage system issues also affects the compute nodes, currently working on fixing the problems.

Posted by Cristian

Investigating — Shortly after the previous update, we started observing issues accessing /scratch again. It is likely some issues in the file system remain, and /scratch access is not completely functional yet. For the time being, jobs are still running and the scheduler is starting new jobs, but the situation may change if we need to start another round of maintenance. We apologize for the inconvenience and will be updating this page as necessary.

Posted by Pedro

Monitoring — The file system checks on the storage servers have finished and access is restored. Suspended jobs have been resumed and submitted jobs can now also run. We will continue to monitor the situation in case the issue reoccurs.

Posted by Pedro

Investigating — A new round of maintenance is required to identify and solve the issues with /scratch. There seems to be too high load on the storage back-end, which results in it becoming inaccessible to users in general. In order to pin-point the cause, all running jobs have been suspended and no new jobs can start while maintenance is ongoing, as yesterday. We apologize for the inconvenience.

Posted by Pedro

Investigating — We are once again seeing high load on the storage servers, which could lead to issues with /scratch.

Posted by Pedro

Monitoring — The issues have been resolved. Jobs have been resumed and new jobs can now start. We will continue to monitor the situation.

Posted by Pedro

Identified — Maintenance must be done on the storage servers to resolve the issues regarding the access of /scratch that started occurring yesterday (06/12) afternoon. During maintenance, all jobs have been suspended and the scheduler will not be accepting new jobs in any partition. Jobs that were ongoing should resume once maintenance ends. We apologize for the inconvenience and will be updating this page with more details as necessary.

Posted by Pedro

Identified — The issue is still present for some users, who are thus unable to access /scratch. We are working on restoring storage.

Posted by Pedro

Monitoring — The system has been rebooted and the issue has been resolved. We will continue to monitor the situation.

Posted by Pedro

Investigating — There is an issue in the Lustre storage servers which results in degraded performance on /scratch. Some users are not able to access /scratch/. We are working on the issue and will be updating this page.

Posted by Pedro