I am running slurm 19.05 on a single machine (Ubuntu 18.04) for scheduling GPU tasks. However, I am having trouble to setup the gpu enforcement with cgroups.
If I set ConstrainDevice=yes in my cgroup.conf file, tensorflow is not able to access my gpu when running srun --gres=gpu:1 run.sh. In contrast, tasks get access to the gpu independently of the allocation when running salloc, i.e., salloc run.sh. Running srun --gres=mps:50 run.sh works again perfectly.
Here is my slurm.conf:
SlurmctldHost=gpu-node1
Epilog=/etc/slurm/epilog.d/epilog.sh
GresTypes=gpu,mps
MpiDefault=none
PluginDir=/usr/lib/slurm
ProctrackType=proctrack/cgroup
Prolog=/etc/slurm/prolog.d/prolog.sh
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm/d
SlurmUser=slurm
SlurmdUser=root
StateSaveLocation=/var/spool/slurm/ctld
SwitchType=switch/none
TaskPlugin=task/cgroup
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=0
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_CPU
AccountingStorageHost=localhost
AccountingStoragePass=/var/run/munge/munge.socket.2
AccountingStorageTRES=gres/gpu,gres/mps
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageUser=slurm
ClusterName=gpu-cluster
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=debug5
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=debug5
SlurmdLogFile=/var/log/slurm/slurmd.log
NodeName=gpu-node1 NodeAddr=localhost Gres=gpu:1,mps:100 CPUs=8 RealMemory=7900 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN
PartitionName=DefaultPartition Nodes=gpu-node1 Default=YES MaxTime=720 State=UP
My cgroup.conf looks like:
CgroupAutomount=yes
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=no
My gres.conf looks like:
Name=gpu Type=rtx2070 File=/dev/nvidia0
Name=mps Count=100 File=/dev/nvidia0
Thanks for your help!