[Beowulf] MPI + IB question

Christopher Samuel samuel at unimelb.edu.au
Sun Nov 18 18:59:59 PST 2012


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 15/11/12 22:02, Bogdan Costescu wrote:

> This is not really a crash... it actually tells you politely that 
> it couldn't reach other ranks and terminates. The following lines:
> 
> Process 1 ([[5187,1],1]) is on host: node24 Process 2 
> ([[5187,1],0]) is on host: node32 BTLs attempted: self sm
> 
> mean that the only qualified to continue BTLs were self and sm, 
> none of which allows inter-node communications. Very likely tcp 
> (which you disabled) was the only inter-node BTL available. So now 
> it's up to you to find out why openib BTL could not be selected...

As Bogdan says you really need to investigate the IB on those two
nodes to see whether they are working or not.

Running ibstatus is probably a good start, to check that the card is
happily talking to the fabric, e.g.:

[root at merri001 ~]# ibstatus
Infiniband device 'mlx4_0' port 1 status:
        default gid:     fe80:0000:0000:0000:0002:c903:0007:3d51
        base lid:        0x5c
        sm lid:          0x1
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            40 Gb/sec (4X QDR)


There's also ibstat which gives you a bit more verbose info.

cheers,
Chris
- -- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with undefined - http://www.enigmail.net/

iEYEARECAAYFAlCpoK8ACgkQO2KABBYQAh8UawCfeemGfxREQTjInM0KyVz0oUhv
l/sAnjbgSMUfIc3q0cjJ47UZkF2DWoui
=CPT2
-----END PGP SIGNATURE-----



More information about the Beowulf mailing list