[Beowulf] slow mpi init/finalize

Christopher Samuel samuel at unimelb.edu.au
Sun Oct 15 16:08:49 PDT 2017


On 12/10/17 01:12, Michael Di Domenico wrote:

> i'm seeing issues on a mellanox fdr10 cluster where the mpi setup and
> teardown takes longer then i expect it should on larger rank count
> jobs.  i'm only trying to run ~1000 ranks and the startup time is over
> a minute.  i tested this with both openmpi and intel mpi, both exhibit
> close to the same behavior.

What wire-up protocol are you using for your MPI in your batch system?

With Slurm at least you should be looking at using PMIx or PMI2 (PMIx
needs Slurm to be compiled against it as an external library, PMI2 is a
contrib plugin in the source tree).

Hope that helps..
Chris
-- 
 Christopher Samuel        Senior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545



More information about the Beowulf mailing list