1

I'm having some problems with a test configuration of Slurm on my laptop. I'm trying to run four slurmd instances on one machine, which is also the same machine as slurmctld runs on. I have a local munged running as user munge. slurmd and slurmctld are running as my user, which is also set in /etc/slurm-llnl/slurm.config.

All slurmd instances connect to slurmctld and I can use sbatch to start a simple job, echoing "Yay!" and exiting with no problem. Problems arise when I try to use salloc and then mpirun in the allocated shell. I get the following error message on all slurmd machines except the first one started:

slurmd: error: Crenedtial signature check: Credential replayed
slurmd: error: Invalid job credential from 1000@127.0.0.1: Invalid job credential

The mpirun will fail with:

--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------

I've read that this could be due to the system time being off or different uids used on different machines, but all processes are run on the same machine.

Anybody got any ideas?

lukas
  • 11

1 Answers1

0

Turns out, I've forgot to configure slurm with --enable-multiple-slurmd. My best guess at what happened is, that all four slurmd daemons tried to validate the same job credential against the same mungod running on localhost. This caused mungod to reject all but the first validation request with Credential replayed. The --enable-multiple-slurmd switch in slurm seems to enable some mechanism to prevent these repeated validation request to mungod.

lukas
  • 11