[Beowulf] IPoIB failure
Peter Kjellström
cap at nsc.liu.se
Wed Jan 28 02:51:16 PST 2015
On Wed, 28 Jan 2015 09:24:39 +1100
Christopher Samuel <samuel at unimelb.edu.au> wrote:
> On 24/01/15 01:29, Lennart Karlsson wrote:
>
> > This reminds me of when we upgraded to SL-6.6 (approximately the
> > same as CentOS-6.6 and RHEL-6.6).
> >
> > The new kernel we got, could not handle our IPoIB for storage
> > traffic, which broke down within a few hours.
>
> Interesting, we use GPFS over IPoIB and upgraded to RHEL 6.6 in early
> November and haven't seen any issues at all (and with a lot of
> bioinfomatics users we'd notice problems pretty quickly).
Redhat has confirmed that there are multiple issues with ipoib in 6.6
and there is a thread for testing fixes at:
[PATCH V3 FIX For-3.19 0/3] IB/ipoib: Fix multicast join flow
https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg22511.html
The problem is most easily demonstrated by restarting the SM and then
bringing up new ipoib interfaces on 6.6 hosts. This creates islands of
connectivity.
We are currently running a 6.6 kernel with the entire ulp/ipoib
directory reverted to 6.5.
/Peter
> Is your IB running in connected mode or datagram mode?
>
> We're in connected mode everywhere because of our BG/Q.
>
> All the best,
> Chris
More information about the Beowulf
mailing list