[Beowulf] IPoIB arp's disappearing

Michael Di Domenico mdidomenico4 at gmail.com
Thu Jul 10 03:36:14 PDT 2008


I'm having a bit of a weird problem that i cannot figure out.  If anyone can
help from the community it would be appreciated.
Here's the packet flow

cn(ib0)->io(ib0)->io(eth5)->pan(*)

cn = compute node
io = io node
pan = panasas storage network

We have 12 shelves of panasas network storage on a seperate network, which
is being fronted by bridge servers which are routing IPoIB traffic to 10G
ethernet traffic.  We're using Mellanox Connect-X Ethernet/IB adapters
everwhere.  We're running Ofed 1.3.1 and the latest firmwares for IB/Eth
everywhere.

Here's the problem.  I can mount the storage on the compute nodes, but if i
try to send anything more then 50MB of data via dd.  I seem to loose the ARP
entries for the compute nodes on the IO servers.  This seems to happen
whether I use the filesystem or a netperf run from the compute node to the
panasas storage

I can run netperf between the compute node and io node and get full IPoIB
line rate with no issues
I can run netperf between the io node and the panasas storage and get full
10G ethernet line rate with no issues

When looking at the TCP traces, i can clearly see that a big chunk of data
is sent between the end-points and then it stalls.  Immediately after the
stall is an ARP request and then another chunk of data, and this scenario
repeats over and over.

Any thoughts or questions?

Thanks
- Michael
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20080710/410effbc/attachment.html>


More information about the Beowulf mailing list