[Beowulf] mpi slow pairs
Prentice Bisbal
prentice.bisbal at rutgers.edu
Tue Sep 2 08:16:27 PDT 2014
On 08/29/2014 11:30 AM, Michael Di Domenico wrote:
> On Fri, Aug 29, 2014 at 9:32 AM, John Hearns <John.Hearns at viglen.co.uk> wrote:
>> I would say the usual tool for that pair-wise comparison is Intel IBM
>> https://software.intel.com/en-us/articles/intel-mpi-benchmarks
>> I hope I have got your requirement correct!
> John,
>
> Close, but not exact. IMB will test ranks, but will not tell me if a
> specific pair of ranks is slower then others, only the collective of
> the ranks under test. what i'm looking for is an mpi version of this
>
> for x in node1->node100
> for y in node1->node100
> if x==y then skip
> else mpirun -n 2 -npernode 1 -host $x,$y bwtest > $x$y.log
>
> unfortunately, the mpirun task takes about 3secs per iteration, and
> with 10k iterations, it's going to take along time and i'm being
> impatient. i've been trying to write the mpi code myself, but my mpi
> is a little rusty so it's slow going...
>
>> Also have you run ibdiagnet to see if anything is flagged up?
> i've run a multitude of ib diags on the machines, but nothing is
> popping out as wrong. what's weird is that it's only certain pairing
> of machines not any one machine in general.
>
I find most of the ibdiag* utilities to be of limited value when
debugging IB issues. Unfortunately, Mellanox's Unified Fabric Manager
(UFM) seems to be the only tool that's helpful for accurately monitoring
and identifying issues with IB networks. I've never used UFM myself, but
my friends at Princeton gave me a demo, and it's seems like a fantastic
tool.
Unfortunately, it's a commercial product, and probably only works on
Mellanox hardware (you don't mention whether your using Qlogic or
Mellanox hardware). The good news is, you can download it and evaluate
it. I'd give that a try, if I were you.
http://www.mellanox.com/page/products_dyn?product_family=100
--
Prentice
More information about the Beowulf
mailing list