[Beowulf] IPoIB failure
Lennart.Karlsson at it.uu.se
Fri Jan 23 06:29:36 PST 2015
On 01/23/2015 02:39 PM, Bill Wichser wrote:
> We had a strange event last night. Our IB fabric started demonstrating some odd routing behavior over IB.
> host A could ping both B and C, yet B and C could not ping one another. This was only at the IP layer. ibping tests all worked fine. A few runs of ibdiagnet produced all the switches and hosts we expected to find.
> As we rebooted hosts with non-connectivity, they came up find but then host A could reach neither one. After a number of host reboots we soon realized that we were playing whack-a-mole as the problem resurfaced sometimes on the original hosts and sometimes on a new host.
> In the end we rebooted every Mellanox switch. The big core switch. The half rack switch. The top of rack switches. And sure enough, everything came back fine without any more reboots.
> At this point all I know is that the server running our master opensm rebooted and it took a few hours before these problems started, first indicated by stale filesystem errors across the GPFS mounts.
> Obviously, rebooting every dang switch is not the correct answer here. But at this point I don't have a better solution if it occurs again. Or even an answer as to WHY it happened in the first place. It just seems that the IPoIB layer was at fault here somehow in that routing was not correct across the entire IB network.
> If anyone has any insights, I'd be most appreciative. It's clear we do not understand this aspect of the IB stack and how this layer works.
This reminds me of when we upgraded to SL-6.6 (approximately the same as CentOS-6.6 and RHEL-6.6).
The new kernel we got, could not handle our IPoIB for storage traffic, which broke down within
a few hours.
As far as I have heard, Redhat tries to fix this. Here is a link to a message indicating this, that
I got from NSC in Linkoping:
-- Lennart Karlsson, UPPMAX, Uppsala University, Sweden
More information about the Beowulf