[Beowulf] IPoIB arp's disappearing

Michael Di Domenico mdidomenico4 at gmail.com
Thu Jul 10 03:36:14 PDT 2008

I'm having a bit of a weird problem that i cannot figure out.  If anyone can
help from the community it would be appreciated.
Here's the packet flow


cn = compute node
io = io node
pan = panasas storage network

We have 12 shelves of panasas network storage on a seperate network, which
is being fronted by bridge servers which are routing IPoIB traffic to 10G
ethernet traffic.  We're using Mellanox Connect-X Ethernet/IB adapters
everwhere.  We're running Ofed 1.3.1 and the latest firmwares for IB/Eth

Here's the problem.  I can mount the storage on the compute nodes, but if i
try to send anything more then 50MB of data via dd.  I seem to loose the ARP
entries for the compute nodes on the IO servers.  This seems to happen
whether I use the filesystem or a netperf run from the compute node to the
panasas storage

I can run netperf between the compute node and io node and get full IPoIB
line rate with no issues
I can run netperf between the io node and the panasas storage and get full
10G ethernet line rate with no issues

When looking at the TCP traces, i can clearly see that a big chunk of data
is sent between the end-points and then it stalls.  Immediately after the
stall is an ARP request and then another chunk of data, and this scenario
repeats over and over.

Any thoughts or questions?

- Michael
