0

I'm working with Slurm and facing issues specifically with the cgroups plugin on Ubuntu 22.04 nodes. Our team is relatively new to Slurm, and we've been trying to optimize our resource management for complex computing tasks. However, we've encountered a series of errors that are proving difficult to resolve.

Here's a brief overview of our problem:

  • We initially used the cgroups V2 plugin on two Ubuntu 22.04 nodes and one Ubuntu 18.04 node, which didn't work as expected.

  • After switching to the cgroups V1 plugin, we could run jobs on the Ubuntu 18.04 node, but the Ubuntu 22.04 nodes started showing errors.

  • The errors include issues with opening and mounting directories in /sys/fs/cgroup, and the nodes go into idle and then drain states post job execution attempts.

As follows, are the logs where the errors start to appear:

[2023-10-12T14:50:29.479] [36.batch] error: unable to open '/sys/fs/cgroup/cpuset//tasks' for reading : No such file or directory
[2023-10-12T14:50:29.511] [36.batch] error: unable to mount cpuset cgroup namespace: Device or resource busy
[2023-10-12T14:50:29.511] [36.batch] error: unable to create cpuset cgroup namespace
[2023-10-12T14:50:29.511] [36.batch] error: unable to open '/sys/fs/cgroup/devices//tasks' for reading : No such file or directory
[2023-10-12T14:50:29.512] [36.batch] cgroup/v1: xcgroup_ns_create: cgroup namespace 'devices' is now mounted
[2023-10-12T14:50:29.514] [36.batch] error: common_cgroup_lock error
[2023-10-12T14:50:29.514] [36.batch] error: task_g_pre_setuid: task/cgroup: Unspecified error
[2023-10-12T14:50:29.514] [36.batch] error: Failed to invoke task plugins: one of task_p_pre_setuid functions returned error
[2023-10-12T14:50:29.515] [36.batch] error: called without a previous init. This shouldn't happen!
[2023-10-12T14:50:29.515] [36.batch] error: job_manager: exiting abnormally: Slurmd could not execve job

We've tried altering kernel parameters, but this hasn't resolved the issue. I'm looking for advice on troubleshooting these errors on Ubuntu 22.04.

Any insights into potential causes or diagnostic tools that could be useful in this scenario?

0 Answers0