I'm having some problems with a test configuration of Slurm on my laptop. I'm trying to run four slurmd instances on one machine, which is also the same machine as slurmctld runs on. I have a local munged running as user munge. slurmd and slurmctld are running as my user, which is also set in /etc/slurm-llnl/slurm.config.
All slurmd instances connect to slurmctld and I can use sbatch to start a simple job, echoing "Yay!" and exiting with no problem. Problems arise when I try to use salloc and then mpirun in the allocated shell. I get the following error message on all slurmd machines except the first one started:
slurmd: error: Crenedtial signature check: Credential replayed
slurmd: error: Invalid job credential from 1000@127.0.0.1: Invalid job credential
The mpirun will fail with:
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
I've read that this could be due to the system time being off or different uids used on different machines, but all processes are run on the same machine.
Anybody got any ideas?