[Beowulf] Help with inconsistent network performance

Tue Dec 18 20:52:25 PST 2007

> The machines are running the 2.6 kernel and I have confirmed that the max
> TCP send/recv buffer sizes are 4MB (more than enough to store the full
> 512x512 image).

the bandwidth-delay product in a lan is low enough to not need 
this kind of tuning.

> I loop with the client side program sending a single integer to rank 0, then
> rank 0 broadcasts this integer to the other nodes, and then all nodes send
> back 1MB / N of data.

hmm, that's a bit harsh, don't you think?  why not have the rank0/master
as each slave for its contribution sequentially?  sure, it introduces a bit
of "dead air", but it's not as if two slaves can stream to a single master 
at once anyway (each can saturate its link, therefore the master's link is 
N-times overcommitted.)

> To make sure there was not an issue with the MPI broadcast, I did one test
> run with 5 nodes only sending back 4 bytes of data each.  The result was a
> RTT of less than 0.3 ms.

isn't that kind of high?  a single ping-pong latency should be ~50 us - 
maybe I'm underestimating the latency of the broadcast itself.

> One interesting pattern I noticed is that the hiccup frame RTTs, almost
> without exception, fall into one of three ranges (approximately 50-60,
> 200-210, and 250-260). Could this be related to exponential back-off?

perhaps introduced by the switch, or perhaps by the fact that the bcast
isn't implemented as an atomic (eth-level) broadcast.

> Tommorow I will experiment with jumbo frames and flow control settings (both
> of which the HP Procurve claims to support).  If these do not solve the
> problems I will start sifting through tcpdump.

I would simply serialize the slaves' responses first.  the current design
tries to trigger all the slaves to send results at once, which is simply
not logical if you think about it, since any one slave can saturate
the master's link.

regards, mark hahn.