Slurm - GPU enforcement with cgroups

Question

I am running slurm 19.05 on a single machine (Ubuntu 18.04) for scheduling GPU tasks. However, I am having trouble to setup the gpu enforcement with cgroups.

If I set ConstrainDevice=yes in my cgroup.conf file, tensorflow is not able to access my gpu when running srun --gres=gpu:1 run.sh. In contrast, tasks get access to the gpu independently of the allocation when running salloc, i.e., salloc run.sh. Running srun --gres=mps:50 run.sh works again perfectly.

Here is my slurm.conf:

SlurmctldHost=gpu-node1
Epilog=/etc/slurm/epilog.d/epilog.sh
GresTypes=gpu,mps 
MpiDefault=none
PluginDir=/usr/lib/slurm
ProctrackType=proctrack/cgroup
Prolog=/etc/slurm/prolog.d/prolog.sh
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm/d
SlurmUser=slurm
SlurmdUser=root 
StateSaveLocation=/var/spool/slurm/ctld
SwitchType=switch/none
TaskPlugin=task/cgroup
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0

FastSchedule=0
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_CPU

AccountingStorageHost=localhost
AccountingStoragePass=/var/run/munge/munge.socket.2
AccountingStorageTRES=gres/gpu,gres/mps
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageUser=slurm
ClusterName=gpu-cluster
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=debug5
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=debug5
SlurmdLogFile=/var/log/slurm/slurmd.log

NodeName=gpu-node1 NodeAddr=localhost Gres=gpu:1,mps:100 CPUs=8 RealMemory=7900 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN 
PartitionName=DefaultPartition Nodes=gpu-node1 Default=YES MaxTime=720 State=UP

My cgroup.conf looks like:

CgroupAutomount=yes 
ConstrainCores=yes 
ConstrainDevices=yes
ConstrainRAMSpace=no

My gres.conf looks like:

Name=gpu Type=rtx2070 File=/dev/nvidia0
Name=mps Count=100 File=/dev/nvidia0

Thanks for your help!

Slurm - GPU enforcement with cgroups

0 Answers0