Questions tagged [slurm]
32 questions
19
votes
3 answers
How can I find out how long my slurm job took to execute?
One idea I have to find out how long my slurm job is taking is to use
squeue --job
How do I find out how long my job took to complete though once the job is complete?
demongolem
- 667
5
votes
1 answer
remove slurm sacct command double entries: "extern"
Jobs currently running show two entries, one of them has an .extern suffix. Completed (or failed) jobs also have a third entry: .batch. Is there a way to remove (or not show these) from the sacct output? What are these entries?
DilithiumMatrix
- 579
4
votes
2 answers
Error code 140 in command running through Nextflow on SLURM
[Note: question heavily edited to correspond to the actual problem]
I'm trying to debug a command that fails only in specific conditions. The failure is with an exitcode 140, but I have no other information.
This command is cat in_file | tr "\t"…
Alexlok
- 143
- 1
- 6
4
votes
3 answers
How to cancel a job that is on completing (CG) state?
I normally submitted some jobs using sbatch and canceled some of them after using scancel. However, they are in state CG and I cannot remove the jobs from my list.
There is any way to get ride off those CG jobs?
Sadly, I'm not the administrator of…
Iago Carvalho
- 141
4
votes
1 answer
Slurm initialization fails in a Raspberry Pi cluster with Raspbian 9.4
I am trying to set up Slurm in a Raspberry Pi cluster with Raspbian 9.4.
I am able to start slurmctld, but when I try to launch slurmd I get the following output:
pi@node1:~ $ slurmd -Dvvvc
slurmd: debug: Log file re-opened
slurmd: error: Domain…
Bub Espinja
- 150
3
votes
1 answer
How to use slurm request for only one core instead of a node or socket?
I wrote Perl scripts to analyze my simulating data. This is not a concurrent program. In the cluster, there are eight nodes. Each of node has 2 sockets which possesses 10 cores. I want to submit my job using Slurm and only request one core to…
Leon
- 131
2
votes
1 answer
Slurm on AWS returns slurmstepd: error: execve(): : No such file or directory
I have installed a Burstable and Event-driven HPC Cluster on AWS Using Slurm according to this tutorial.
With this installation I can burst instances and run jobs in the Slurm environment on EC2. After running:
#!/bin/bash
#SBATCH --nodes=2
#SBATCH…
2
votes
0 answers
How to use SLURM's --dependency=expand: correctly
I have 1 slurm job unfinished out of 5 that's been running 19 hours and I'm concerned that it will hit walltime before it finishes. I'm not the admin and it's the weekend, so I would like to try using this feature I discovered recently shown in…
hepcat72
- 155
- 7
1
vote
0 answers
Sorting, merging, and deduplicating large txt files on HPC?
I am looking for some advice
I am currently creating a kmer database and looking to merge/sort and take uniq lines from 47 sample.txt.gz which are 16gb each, What would be the fastest way to do this.
i currently running this:
zcat…
user29589776
1
vote
1 answer
Where is the path of slurmd binary in a HPC system?
I want to find the path of slurmd binary in a HPC system. I used which slurmd but there was an error:
/usr/bin/which: no slurmd…
Martin
- 11
1
vote
0 answers
How to make a host file in SLURM with $SLURM_JOB_NODELIST
I have access to a HPC with 40 cores on each node. I have a batch file to run a total of 35 codes which are in separate folders. Each code is an open mp code which requires 4 cores each. so how do I allocate resources such that each code gets 4…
1
vote
1 answer
SLURM setting nodes to drain due to low socket-core-thread-cpu count
I have SLURM set up with a couple of workstations. There are different kinds, but let's take one with a CPU which has 4 cores and no additional SMT, so 4 threads in total. lscpu shows me the following:
$ lscpu
Architecture: x86_64
CPU…
Martin Ueding
- 2,485
1
vote
1 answer
slurmd: Invalid job credential
I'm having some problems with a test configuration of Slurm on my laptop. I'm trying to run four slurmd instances on one machine, which is also the same machine as slurmctld runs on. I have a local munged running as user munge. slurmd and slurmctld…
lukas
- 11
1
vote
0 answers
Slurm - GPU enforcement with cgroups
I am running slurm 19.05 on a single machine (Ubuntu 18.04) for scheduling GPU tasks. However, I am having trouble to setup the gpu enforcement with cgroups.
If I set ConstrainDevice=yes in my cgroup.conf file, tensorflow is not able to access my…
Jonas
- 11
1
vote
1 answer
Ubuntu 18.10 and modify installed package - OpenMPI
I've installed openmpi-bin (OpenMPI 3.1) on Ubuntu 18.10. I also run slurm on the same machine and would like to recompile or reconfigure my installation of OpenMPI to cope with Slurm-feature.
If one installs OpenMPI from source, there is a setting…
Paer
- 21