Issues on /scratch - 06/12/2023

Resolved — We have identified and isolated the cause of this problem and haven't experienced issues since. This has now been resolved

Posted January 16, 2024 13:14 by Pedro

Monitoring — The file system hasn't crashed during the holidays, but we're still monitoring it.

Posted January 08, 2024 11:53 by Bob

Investigating — The file system is available again, jobs have been resumed, and all partitions have been reactivated.

Posted December 22, 2023 16:37 by Bob

Investigating — The file system collapsed again, and we're currently trying to bring it back. Meanwhile, the partitions have been disabled and all running jobs have been suspended again.

Posted December 22, 2023 15:38 by Bob

Investigating — The /scratch file system is available again, and jobs have been resumed.

Posted December 22, 2023 13:46 by Bob

Investigating — Unfortunately, the issue just resurfaced again.

Posted December 22, 2023 11:41 by Bob

Investigating — The Lustre software has been upgraded, the storage has been remounted, and the nodes have been made available again. So far, we haven't seen any errors or issues on the storage servers, but we keep monitoring them closely.

Posted December 21, 2023 20:02 by Bob

Investigating — We will upgrade the Lustre /scratch file system software from version 2.15.2 to version 2.15.3 to make sure we are not affected by the bugs that have been patched in the latest release.

Posted December 21, 2023 10:10 by Fokke

Investigating — As part of ongoing maintenance, /scratch is currently not accessible. We will be carrying updates to storage infrastructure software which are likely to require rebooting. We apologise for the inconvenience.

Posted December 20, 2023 16:22 by Pedro

Investigating — We've restarted the storage controllers yesterday to make sure that these were not causing issues. We've also updated their firmware. Unfortunately we still encountered the same issue again yesterday evening, where the file system processes accessing the storage device hang and won't respond any more. For now a failover to the 2nd storage server fixed the issue, but this only works when there is still a healthy storage server to failover to.

Posted December 19, 2023 10:01 by Fokke

Investigating — We ran short jobs over the weekend succesfully, but when gearing up production the /scratch file system failed again. We are now going to reboot the storage controllers (although they don't report issues) to make sure those are working properly.

Posted December 18, 2023 11:03 by Fokke

Investigating — The nodes have been rebooted and the storage has been reconfigured and remounted. We're once again going to gradually make some compute nodes available, and we keep monitoring (the load on) the storage servers.

Posted December 15, 2023 16:22 by Bob

Investigating — In order to reconfigure the storage, all nodes have to rebooted again. This means that any running jobs will be lost, unfortunately.

Posted December 15, 2023 13:34 by Bob

Investigating — The storage unfortunately failed again after a few hours, and we need to check and possibly adjust some configuration settings.

Posted December 13, 2023 17:14 by Fokke

Monitoring — After another round of maintenance and recovery, in addition to putting some possible high-load jobs on hold, we are now gradually bringing nodes back up and allowing new jobs to be started.

Posted December 13, 2023 12:11 by Pedro

Investigating — Unfortunately, this issue has resurfaced today and /scratch is no longer accessible. To perform additional maintenance, new jobs have been put on hold, and will stay pending on queue until the problem is resolved. Jobs that had already started running have not been paused and are still running.

Posted December 12, 2023 13:15 by Pedro

Monitoring — All nodes have been rebooted, and the /scratch file system has been remounted. We're slowly making some compute nodes available again, and we will keep monitoring the stability of the file system.

Posted December 11, 2023 15:40 by Bob

Investigating — Due to the storage issues we're seeing a lot of zombie processes on the compute nodes. In order to get rid of these, we are going to reboot all compute nodes. Unfortunately, this means that the few jobs that were still running will be cancelled as well.

Posted December 11, 2023 13:41 by Bob

Investigating — The storage system issues also affects the compute nodes, currently working on fixing the problems.

Posted December 11, 2023 08:55 by Cristian

Investigating — Shortly after the previous update, we started observing issues accessing /scratch again. It is likely some issues in the file system remain, and /scratch access is not completely functional yet. For the time being, jobs are still running and the scheduler is starting new jobs, but the situation may change if we need to start another round of maintenance. We apologize for the inconvenience and will be updating this page as necessary.

Posted December 08, 2023 17:32 by Pedro

Monitoring — The file system checks on the storage servers have finished and access is restored. Suspended jobs have been resumed and submitted jobs can now also run. We will continue to monitor the situation in case the issue reoccurs.

Posted December 08, 2023 17:13 by Pedro

Investigating — A new round of maintenance is required to identify and solve the issues with /scratch. There seems to be too high load on the storage back-end, which results in it becoming inaccessible to users in general. In order to pin-point the cause, all running jobs have been suspended and no new jobs can start while maintenance is ongoing, as yesterday. We apologize for the inconvenience.

Posted December 08, 2023 11:57 by Pedro

Investigating — We are once again seeing high load on the storage servers, which could lead to issues with /scratch.

Posted December 07, 2023 16:23 by Pedro

Monitoring — The issues have been resolved. Jobs have been resumed and new jobs can now start. We will continue to monitor the situation.

Posted December 07, 2023 14:34 by Pedro

Identified — Maintenance must be done on the storage servers to resolve the issues regarding the access of /scratch that started occurring yesterday (06/12) afternoon. During maintenance, all jobs have been suspended and the scheduler will not be accepting new jobs in any partition. Jobs that were ongoing should resume once maintenance ends. We apologize for the inconvenience and will be updating this page with more details as necessary.

Posted December 07, 2023 10:08 by Pedro

Identified — The issue is still present for some users, who are thus unable to access /scratch. We are working on restoring storage.

Posted December 06, 2023 16:57 by Pedro

Monitoring — The system has been rebooted and the issue has been resolved. We will continue to monitor the situation.

Posted December 06, 2023 12:58 by Pedro

Investigating — There is an issue in the Lustre storage servers which results in degraded performance on /scratch. Some users are not able to access /scratch/. We are working on the issue and will be updating this page.

Posted December 06, 2023 12:01 by Pedro