[Beowulf] IPoIB failure

Fri Jan 23 05:39:34 PST 2015

We had a strange event last night.  Our IB fabric started demonstrating 
some odd routing behavior over IB.

host A could ping both B and C, yet B and C could not ping one another. 
  This was only at the IP layer.  ibping tests all worked fine.  A few 
runs of ibdiagnet produced all the switches and hosts we expected to find.

As we rebooted hosts with non-connectivity, they came up find but then 
host A could reach neither one.  After a number of host reboots we soon 
realized that we were playing whack-a-mole as the problem resurfaced 
sometimes on the original hosts and sometimes on a new host.

In the end we rebooted every Mellanox switch.  The big core switch.  The 
half rack switch.  The top of rack switches.  And sure enough, 
everything came back fine without any more reboots.

At this point all I know is that the server running our master opensm 
rebooted and it took a few hours before these problems started, first 
indicated by stale filesystem errors across the GPFS mounts.

Obviously, rebooting every dang switch is not the correct answer here. 
But at this point I don't have a better solution if it occurs again.  Or 
even an answer as to WHY it happened in the first place.  It just seems 
that the IPoIB layer was at fault here somehow in that routing was not 
correct across the entire IB network.

If anyone has any insights, I'd be most appreciative.  It's clear we do 
not understand this aspect of the IB stack and how this layer works.

Thanks,
Bill