Hábrók cluster down - Hábrók available again

Opened on Saturday 16th March 2024, last updated May 02, 2024 10:50

Resolved — We have seen no new issues since and Hábrók is operational.

Posted May 02, 2024 10:50 by Pedro

Monitoring — Almost all A100 GPU nodes are now available, except for one which still unavailable at the moment. With this, the system is almost entirely operational again. We will keep monitoring the situation for anything out of the ordinary. As before, if you encounter any issues, please report them to us at hpc@rug.nl. Keep in mind that the first time you connect to Hábrók after the recent downtime you will see a warning regarding a change in the server host keys. This is expected and you can find more information at: https://wiki.hpc.rug.nl/habrok/additional_information/known_issues#new_server_hostkey_21032024

Posted March 22, 2024 14:39 by Pedro

Monitoring — The login and interactive nodes have been brought back online and it is now possible to connect. /scratch also available and so are many of the compute nodes, which means that jobs can be queued and run. The A100 GPU nodes are still unavailable at the moment. We are continuing to monitor the situation and working on restoring the full capabilities of the system, for which we will keep providing updates. If you encounter any issues, please report them to us at hpc@rug.nl. When connecting to Hábrók again, you will see a warning regarding a change in the server host keys. This is expected and you can find information on how to proceed on our wiki at: https://wiki.hpc.rug.nl/habrok/additional_information/known_issues#new_server_hostkey_21032024

Posted March 21, 2024 14:49 by Pedro

Investigating — On Tuesday, again significant progress was made in Habrok's recovery. Among other things, the scratch system is up again. Unfortunately, this does not yet mean that the complete Hábrók is operational. We are expected to need a few more days for this.

Posted March 20, 2024 10:00 by Pedro

Investigating — Since Saturday, work has been underway to restore the various parts of Hábrók. Considerable steps have now been taken, together with Dell and StackHPC, and the first signs of recovery are positive. However, the systems are not yet operational. Besides Hábrók's recovery, there is also close contact with the Kapteyn Institute and UMCG about checking and restoring their systems in the CBC. Meanwhile, the investigation into the cause of the gas extinguisher going off also continues. This involves close cooperation with Facility Services colleagues. However, it appears to have been a valid fire alarm. In addition, the refilling of the gas extinguishing system and the administrative handling of the situation are also being addressed.

Posted March 18, 2024 19:02 by Bob

Investigating — On Saturday evening, together with our supplier Dell, the recovery work of the Hábrók system was started. This work will continue for the next few days. We understand that there are many questions about the situation, about whether data is still available and when Hábrók will be available for compute again. Unfortunately, we cannot answer these questions at this time. As previously reported, this should become clear in the coming days. Via the Hábrók status page, the mailing list and via Iris, we will continue to publish updates. In any case, we will post another update during Monday afternoon. Besides working hard to restore Hábrók, we are also conducting extensive investigations with suppliers into the cause of the fire system going off.

Posted March 17, 2024 14:22 by Bob

Investigating — Update on the impact of the fire alarm at the Coenraad Bron Datacenter (CBC). A more detailed inspection of the systems in the CBC took place today (Saturday 16 March). The fire alarm triggered the gas extinguishing system. This probably caused some of the hard drives to fail, resulting in the Hábrók cluster being completely down at the moment. The next step for us is to analyse the affected disks and replace them where necessary. We are in close contact with the supplier about accelerated delivery of new disks and support. At this moment, we cannot estimate the lead time for this and so it is still unclear how long Hábrók will not be usable. This should become clear over the next few days. Also, we unfortunately cannot say anything at this point about any data loss. Further analysis will have to show this. We will post the next update on Sunday afternoon.

Posted March 16, 2024 19:41 by Bob

Investigating — On Friday evening, 15 March, there was a fire alarm at the Coenraad Bron Centre Datacenter. As a result, the automatic extinguishing system was switched on. The fire brigade attended the scene and found no visible fire damage. We are currently investigating the cause of the fire alarm and the exact consequences. As a result of the automatic extinguishing action, several systems went down, including (part of) the HPC cluster Habrok. At the moment, we are unable to indicate the exact impact and when systems will be accessible again. New announcements will follow on Saturday evening.

Posted March 16, 2024 12:36 by Bob

Investigating — We are seeing problems with the scratch filesystem which seem to have taken down the cluster. We are Investigating.

Posted March 16, 2024 07:39 by Henk-Jan