[Beowulf] slow mpi init/finalize
Christopher Samuel
samuel at unimelb.edu.au
Sun Oct 15 16:08:49 PDT 2017
On 12/10/17 01:12, Michael Di Domenico wrote:
> i'm seeing issues on a mellanox fdr10 cluster where the mpi setup and
> teardown takes longer then i expect it should on larger rank count
> jobs. i'm only trying to run ~1000 ranks and the startup time is over
> a minute. i tested this with both openmpi and intel mpi, both exhibit
> close to the same behavior.
What wire-up protocol are you using for your MPI in your batch system?
With Slurm at least you should be looking at using PMIx or PMI2 (PMIx
needs Slurm to be compiled against it as an external library, PMI2 is a
contrib plugin in the source tree).
Hope that helps..
Chris
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
More information about the Beowulf
mailing list