[Beowulf] Help with inconsistent network performance

Tue Dec 18 21:40:48 PST 2007

On 12/18/07, Mark Hahn <hahn at mcmaster.ca > wrote:
>
> > The machines are running the 2.6 kernel and I have confirmed that the
> max
> > TCP send/recv buffer sizes are 4MB (more than enough to store the full
> > 512x512 image).
>
> the bandwidth-delay product in a lan is low enough to not need
> this kind of tuning.

I didn't actually do any tuning, I just checked the max buffer size that the
linux auto-tuning can use is sufficient.

> I loop with the client side program sending a single integer to rank 0,
> then
> > rank 0 broadcasts this integer to the other nodes, and then all nodes
> send
> > back 1MB / N of data.
>
> hmm, that's a bit harsh, don't you think?  why not have the rank0/master
> as each slave for its contribution sequentially?  sure, it introduces a
> bit
> of "dead air", but it's not as if two slaves can stream to a single master
> at once anyway (each can saturate its link, therefore the master's link is
>
> N-times overcommitted.)

I guess I figured that the data is relatively small compared to the
bandwidth, whereas the latency for ethernet is relatively high.  I also
thought the switch would be able to
efficiently buffer and forward the data.  I am not much of a
networking guy (more a graphics guy) so I realize I could be way off
base here.

> To make sure there was not an issue with the MPI broadcast, I did one test
> > run with 5 nodes only sending back 4 bytes of data each.  The result was
> a
> > RTT of less than 0.3 ms.
>
> isn't that kind of high?  a single ping-pong latency should be ~50 us -
> maybe I'm underestimating the latency of the broadcast itself.

This is quite a bit more than a single ping-pong. The viewer sends to the
master node (rank 0), and then the master node broadcasts to all other
nodes, and then all nodes send back to the viewer node.  I don't know if
this is still seems high?

> One interesting pattern I noticed is that the hiccup frame RTTs, almost
> > without exception, fall into one of three ranges (approximately 50-60,
> > 200-210, and 250-260). Could this be related to exponential back-off?
>
> perhaps introduced by the switch, or perhaps by the fact that the bcast
> isn't implemented as an atomic (eth-level) broadcast.
>

But the bcast is always just sending 4 bytes (a single integer), and as
mentioned above no hiccups occur until the size of the final gather packets
(from all nodes to the viewer) is increased.

>
> > Tommorow I will experiment with jumbo frames and flow control settings
> (both
> > of which the HP Procurve claims to support).  If these do not solve the
> > problems I will start sifting through tcpdump.
>
> I would simply serialize the slaves' responses first.  the current design
> tries to trigger all the slaves to send results at once, which is simply
> not logical if you think about it, since any one slave can saturate
> the master's link.
>

I still have the feeling that the switch should be able to handle this more
efficiently, but since your idea is relatively simple to implement I will
give it a try and see what the performance is like.

Thanks for your input.

>
> regards, mark hahn.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071218/2946b0f6/attachment.html>