[Beowulf] Puzzling Intel mpi behavior with slurm
cap at nsc.liu.se
Wed Apr 11 07:38:31 PDT 2018
On Thu, 05 Apr 2018 09:10:57 -0600
Faraz Hussain <info at feacluster.com> wrote:
> Here's something quite baffling. I have a cluster running slurm but
> have not setup passwordless ssh for a user yet. So when the user
> runs "mpirun -n 2 -hostfile hosts hostname", it will hang because of
> ssh issue. That is expected.
> Now the baffling thing is the mpirun command works inside a slurm
> script! How can it work if passwordless ssh has not been configured?
> Does slurm use some different authentication (munge?) to login to
> the hosts and execute the hostname command?
What happens is that mpirun sees the slurm environment variables and
switches to a slurm aware mode.
In this mode it uses srun to to launch pmi_proxy processes on each node
of the job. Then it proceeds to start all ranks using these pmi_proxy
The process tree ends up being something like this on the first node:
slurmd->slurmstepd->bash(jobscript)->mpirun->srun -w nodes[..] pmi_proxy
And on the other nodes:
Authentication/authorization is handled by slurm and depens on how you
set it up (often munge).
More information about the Beowulf