[Beowulf] mpi slow pairs

Tue Sep 2 08:16:27 PDT 2014

On 08/29/2014 11:30 AM, Michael Di Domenico wrote:
> On Fri, Aug 29, 2014 at 9:32 AM, John Hearns <John.Hearns at viglen.co.uk> wrote:
>> I would say the usual tool for that pair-wise comparison is Intel IBM
>> https://software.intel.com/en-us/articles/intel-mpi-benchmarks
>> I hope I have got your requirement correct!
> John,
>
> Close, but not exact.  IMB will test ranks, but will not tell me if a
> specific pair of ranks is slower then others, only the collective of
> the ranks under test.  what i'm looking for is an mpi version of this
>
> for x in node1->node100
> for y in node1->node100
> if x==y then skip
> else mpirun -n 2 -npernode 1 -host $x,$y bwtest > $x$y.log
>
> unfortunately, the mpirun task takes about 3secs per iteration, and
> with 10k iterations, it's going to take along time and i'm being
> impatient.  i've been trying to write the mpi code myself, but my mpi
> is a little rusty so it's slow going...
>
>> Also have you run  ibdiagnet to see if anything is flagged up?
> i've run a multitude of ib diags on the machines, but nothing is
> popping out as wrong.  what's weird is that it's only certain pairing
> of machines not any one machine in general.
>

I find most of the ibdiag* utilities to be of limited value when 
debugging IB issues. Unfortunately, Mellanox's Unified Fabric Manager 
(UFM) seems to be the only tool that's helpful for accurately monitoring 
and identifying issues with IB networks. I've never used UFM myself, but 
my friends at Princeton gave me a demo, and it's seems like a fantastic 
tool.

Unfortunately, it's a commercial product, and probably only works on 
Mellanox hardware (you don't mention whether your using Qlogic or 
Mellanox hardware). The good news is, you can download it and evaluate 
it. I'd give that a try, if I were you.

http://www.mellanox.com/page/products_dyn?product_family=100

--
Prentice