[Beowulf] slow mpi init/finalize

Tue Oct 17 05:54:14 PDT 2017

On Mon, 16 Oct 2017 13:11:37 -0400
Michael Di Domenico <mdidomenico4 at gmail.com> wrote:

> On Mon, Oct 16, 2017 at 7:16 AM, Peter Kjellström <cap at nsc.liu.se>
> wrote:
> > Another is that your MPIs tried to use rdmacm and that in turn
> > tried to use ibacm which, if incorrectly setup, times out after
> > ~1m. You can verify ibacm functionality by running for example:
> >
> > user at n1 $ ib_acme -d n2
> > ...
> > user at n1 $
> >
> > This should be near instant if ibacm works as it should.  
> 
> i didn't specifically tell mpi to use one connection setup vs another,
> but i'll see if i can track down what openmpi is doing in that regard.
> 
> however, your test above fails on my machines
> 
> user at n1# ib_acme -d n3
> service: localhost
> destination: n3
> ib_acm_resolve_ip failed: cannot assign requested address
> return status 0x0

Did this fail instantly or with the typical ~1m timeout?

> in the /etc/rdma/ibacme_addr.cfg file i just lists the data specific
> to each host, which is gathered by ib_acme -A

Often you don't need ibacm running and if you stop it this specific
problem will go away (ie. no one can ask ibacm for stuff and hang on
timeout). The service is typically /etc/init.d/ibacm. What will happen
then if something uses librdmacm for lookups is that it will result in a
direct query to the SA (part of the subnet manager). On a larger
cluster and for certain use cases this can quickly become too much
(hence the need for caching).

If you have IntelMPI also try what I suggested and use the ucm dapl.
For example for the first port on an mlx4 hca that's "ofa-v2-mlx4_0-1u".

You can make sure that it comes first in your dat.conf (/etc/rmda
or /etc/infiniband) or pass it explicitly to IntelMPI:

I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1u mpiexec.hydra ...

You may want to set I_MPI_DEBUG=4 or so to see what it does.

/Peter K