newbie: 16-node 500Mbps design

Josip Loncaric josip at icase.edu
Mon Aug 28 16:00:31 PDT 2000


Mark Hahn wrote:
> 
> no.  Josip's (fine) works is a specific tuning for small-packet performance;
> it violates the standards, or at least accepted practice for TCP.

Using retransmit timeouts shorter than 200ms may break TCP connections
to older BSD hosts.  However, this fixed 200ms floor value of the
retransmit interval is ridiculous on a closed Beowulf network (200ms at
100Mbit/s represents 2.5 MBytes).  BTW, retransmit timeout is adaptively
estimated by TCP, so limiting this estimate from below by using the
floor of 20ms (which Linux can easily handle on Intels) is appropriate
for Beowulf use.

> that's fine for tweaking your cluster, but it does NOT show a general problem
> with stalls.  it's a little unclear to me why he calls these events
> "deadlocks", since afaikt, they're simply retransmit timeouts in TCP
> terminology, part of TCP's congestion-avoidance heuristics.

TCP stalls happen often, but as long as there is a good reason (e.g.
congestion) I would not call them 'deadlocks'.  However, when both
sender and receiver have the capacity to transfer more data, but are
forced to wait for a timeout because of a deadlock in TCP logic, then
the term is appropriate.  One form of this deadlock is described in:

"How a large ATM MTU causes deadlocks in TCP data transfers," by Kjersti
Moldeklev and Per Gunningberg, IEEE/ACM Trans. on Networking, v3, No. 4,
Aug. 1995, pp. 409-422. (see
http://www2.comp.polyu.edu.hk/~comp555/INPSII/deadlock.pdf)

A common feature of this deadlock and the one my patch addresses is the
fact that delayed ACKs could be mistaken for network congestion.  My
simple fix reduces the probability of deadlocks by using immediate ACKs
with (adjustable) probability (we use p=1/8).  In seeking a
deadlock-free TCP, others have proposed a more elaborate Adaptive
Acknowledgment Algorithm:  

Adam Yeung and Rocky K. C. Chang, "Improving TCP Throughput Performance
on High-Speed Networks with a Receiver-Side Adaptive Acknowledgment
Algorithm."  (see
http://www2.comp.polyu.edu.hk/~comp555/INPSII/d2-5.pdf)

Sincerely,
Josip

-- 
Dr. Josip Loncaric, Senior Staff Scientist        mailto:josip at icase.edu
ICASE, Mail Stop 132C           PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center             mailto:j.loncaric at larc.nasa.gov
Hampton, VA 23681-2199, USA    Tel. +1 757 864-2192  Fax +1 757 864-6134




More information about the Beowulf mailing list