[Beowulf] slow mpi init/finalize

Michael Di Domenico mdidomenico4 at gmail.com
Tue Oct 17 07:59:41 PDT 2017


On Tue, Oct 17, 2017 at 8:54 AM, Peter Kjellström <cap at nsc.liu.se> wrote:
> If you have IntelMPI also try what I suggested and use the ucm dapl.
> For example for the first port on an mlx4 hca that's "ofa-v2-mlx4_0-1u".
>
> You can make sure that it comes first in your dat.conf (/etc/rmda
> or /etc/infiniband) or pass it explicitly to IntelMPI:
>
> I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1u mpiexec.hydra ...
>
> You may want to set I_MPI_DEBUG=4 or so to see what it does.

i can confirm that the dapl test with intelmpi is pretty speedy.

when i startup an mpi job without dapl enabled it takes ~60 seconds
before the test actually starts, with dapl enabled it's only a few
seconds.  and the t_avg timings in imb alltoallv i'm running are
vastly different.

i think i can safely say at this point it's probably not hardware
related, but something went wonky with openmpi.  i downloaded the new
version 3 that was released, i'll see if that fixes anything.  i've
been tracking reports on the openmpi list about issues between slurm
and openmpi with relation to pmi, i'm not sure if it's related or not,
but might be.


More information about the Beowulf mailing list