[Beowulf] Infiniband PortXmitWait problems on IBM Sandybridge iDataplex with Mellanox ConnectX-3

Christopher Samuel samuel at unimelb.edu.au
Tue Jun 11 22:03:11 PDT 2013

Hash: SHA1

Hi folks,

I'm doing the bring up and testing on our SandyBridge IBM iDataplex
with an FDR switch and as part of that I've been doing burn-in testing
with HPL and seeing really poor efficiency (~25% over 65 odd nodes
with 256GB RAM).  Simultaneously HPL on the 3 nodes with 512GB RAM
gives ~70% efficiency.

Checking the switch with ibqueryerrors shows lots of things like:

   GUID 0x2c90300771450 port 22: [PortXmitWait == 198817026]

That's about 2 or 3 hours after last clearing the counters. :-(


# ibclearcounters && ibclearerrors && sleep 1 && ibqueryerrors

Shows 75 of 94 nodes bad, pretty much all with thousands of
PortXmitWait, some into the 10's of thousands.

We are running RHEL 6.3, Mellanox OFED 2.0.5, FDR IB and Open-MPI 1.6.4.

Talking with another site who also has the same sort of iDataplex, but
running RHEL 5.8, Mellanox OFED 1.5 and QDR I, reveals that they (once
they started looking) are also seeing high PortXmitWait counters
shortly after clearing them with user codes.

These are Mellanox MT27500 ConnectX-3 adapters.

We're talking with both IBM and Mellanox directly, but other than
Mellanox spotting some GPFS NSD file servers that had bad FDR ports
(which got unplugged last week and fixed today) we've not made any
progress into the underlying cause. :-(

Has anyone seen anything like this before?

- -- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci

Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/


More information about the Beowulf mailing list