Slurm jobs are pending, but resources are available

Question

I use SLURM as a job scheduler and queue for a small cluster (single node with 64 cores). To submit a batch job I use:

> sbatch run.sh

Where run.sh looks like:

#! /bin/bash

#SBATCH --ntasks=4

export OMP_THREAD_LIMIT=4
/home/Benchmarks/Graph500/omp-csr/omp-csr -s 23

However, when I submit 2 batch jobs one after the other I get:

> squeue
     JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
        29     debug   run.sh anonymou PD       0:00      1 (Resources)
        28     debug   run.sh anonymou  R       1:13      1 localhost

Each job needs only 4 cores, so both jobs should run. Maybe I misconfigured the slurm controller, the relevant lines from /etc/slurm.conf are:

# COMPUTE NODES
NodeName=localhost CPUs=64 Sockets=4 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
PartitionName=debug Nodes=localhost Default=YES MaxTime=INFINITE State=UP

I would be thankful for any help/hint.

score 7 · Accepted Answer · answered Mar 05 '15 at 13:25

SLURM by default does not allow resources sharing, so when a job runs in 1 node the rest of the jobs wait for it to complete before executing any further jobs on the same node.

SLURM needs to be configured for resources sharing, this should be fairly simple and well documented.

An example of what to add to your slurm.conf file (normally located under /etc/slurm) would be:

SelectType=select/cons_res
SelectTypeParameters=
DefMemPerCPU=

This would allow sharing of the resources of a node using the con_res plugin.

The select/con_res plugin allows a wide variety of Parameters (SelectTypeParameters). The most prominent are listed below (for a full list of parameters please refer to the manual page of slurm.conf):

CR_CPU: CPUs are the consumable resource.
CR_CPU_Memory: adds memory as consumable to CR_CPU.
CR_Core Cores: Cores are the consumable resource.
CR_Core_Memory: adds memory as consumable to CR_CPU_Memory.

After that is done and you have selected the type of resource you want to use as consumable in SLURM all you need to do is add the option shared=yes to your default queue and issue the command scontrol reconfigure in the node that is being used as controller.

Slurm jobs are pending, but resources are available

1 Answers1