Issues with the scheduler

Opened on Tuesday 16th January 2024, last updated March 08, 2024 11:49

Resolved — We have updated slurm to a new version that should address these problems and haven't heard reports of issues since. We are then marking this issue as resolved

Posted March 08, 2024 11:49 by Pedro

Monitoring — Most of the nodes have now been downgraded to an earlier release of the scheduler software. Several users reported job failures for their jobs using Intel MPI. This issue has been resolved by also downgrading the Slurm version on the login nodes. We've also tested the latest release 23.11.2. This did not solve the problems, we've seen starting with 23.11.1. Unfortunately this upgrade triggered job failures on compute nodes that were in a draining state still running 23.11.1. We apologize for the inconvenience. For now we will keep running 23.11.2 on the central services and 23.02.7 on the compute and login nodes. When new updates to 23.11 arrive these will be tested on a few separate nodes.

Posted January 25, 2024 10:42 by Fokke

Investigating — We have found that downgrading the scheduler service on the compute nodes appears to be a workaround for the issue. We are currently in the process of doing this. Unfortunately downgrading the central scheduler components would require full downtime for the control daemon and downgrading the job database is not supported at all. We will therefore try to stick with the workaround until the issue has been fixed in the software.

Posted January 22, 2024 16:32 by Fokke

Investigating — Since upgrading to a more recent version of the slurm scheduler system we have started seeing issues in the communication between the scheduler components. This results in some jobs being stuck in the CG state, even if they have cancelled by users or otherwise terminated normally. In some situations, jobs with this behavior don't produce a log file or results. Additionally, some users have reported errors loading modules at the beginning of their jobs. This likely happens because the scheduler is unable to initialize the job environment properly, resulting in error messages reporting that the module command is not available. The jobs then immediately fail because of this error. In some cases, adding the line "module purge" to the jobscript before loading modules has helped minimize this problem, but not in every situation. We apologize for the inconvenience and are working on understanding and solving the problem. To do so, nodes often need to be restarted. This takes time because nodes can only be restarted after they are drained (i.e., all ongoing jobs have finished normally) to prevent the cancellation of running jobs. Because of these issues, we are seeing a higher job failure rate than normal, even in situations when there is nothing necessarily incorrect in the jobscripts or code that is called within them. If you are experiencing these issues, resubmitting the failed jobs could help as the problems may not manifest in the re-submissions.

Posted January 16, 2024 13:47 by Pedro