[Beowulf] SATA II - PXE+NFS - diskless compute nodes

Thu Dec 14 15:01:47 PST 2006

Donald Becker wrote:

>>
>>I'm not quite following here: It seems like you might be advocating
>>retransmits every half second. I'm current doing classical exponential
>>backoff, 1 second delay, then two, then four etc. Will that bite me?
> 
> 
> Where are you you doing exponential back-off?  
re-transmits in the TFTP server: sent a block and await the 
corresponding ACK; if it doesn't arrive for timeout, re-send. This is 
needed to recover from lost data packets, client retries only recover 
from lost ACKs (at least they do in implementations which have been 
immunised against sorcerers-apprentice syndrome.)

> The TFTP client will/should/might do a retry every second.  (Background:
> TFTP uses "ACK" of the previous packet to mean "send the next one".  The
> only way to detect this is a retry is timing.) The client might do a
> re-ARP first.  In corner cases it might not reply to ARP itself.
> 
> [[ Step up on the soapbox. ]]
> 
> What idiot thought that exponential backoff was a good idea?
> Exponential backoff doesn't make sense where your base time period is a
> whole second and you can't tell if the reason for no response is
> failure, busy network or no one listening.
> 
> My guess is that they were just copying Ethernet, where modified,
> randomized exponential backoff is what makes it magically good.
> Exponential backoff makes sense at the microsecond level, where you have
> a collision domain and potentially 10,000 hosts on a shared ether.  Even
> there the idea of "carrier sense" or 'is the network busy' is what
> enables Ethernet to work at 98+% utilization rather than the 18% or 37%
> theoretical of Aloha Net.  (Key difference: deaf transmitter.)
> 
> What usually happens with DHCP and PXE is that the first packet is used
> getting the NIC to transmit correctly.  The second packet is used to get
> the switch to start passing traffic.  The third packet get through but we
> are already well into the exponential fallback.
> 
> PXE would be much better and more reliable if it started out
> transmitting a burst of four DHCP packets even spaced in the first
> second, then falling back to once per second.  If there is a concern
> about DHCP being a high percentage of traffic in huge installations
> running 10baseT, tell them to buy a server. Or, like, you know, a
> router.  Because later the ARP traffic alone will dwarf a few DHCP
> broadcasts.

It's probably worth differentiating DHCP and TFTP here. I guess the 
reason for exponential-backoff of to avoid congestion-collapse as the 
ratio of bits-on-the-wire to useful work decreases. By the time a host 
is doing TFTP the network-path should be established, so bursting 
packets shouldn't be needed. Maybe delaying backoff would make sense.
> 
> 
>>I'm doing round-robin, but I don't see how to throttle active
>>connections: do I need to do that, or just limit total bandwidth?
> 
> 
> Yes, you need to throttle active TFTP connections.  The clients
> currently winning can turn around a next-packet request really quickly.
> If a few get in lock step, the server will have the next chunk of the
> file warm in the cache.  This is the start of locking out the first
> loser.
> 
> You can't just let the ACKs queue up in the socket as a substitute for
> deferring responses either.  You have to pull them out ASAP and mark
> that client as needing a response.  This doesn't cost very much.   You
> need to keep the client state structure anyway.  This is just one more
> bit, plus updating the timeval that you should be keeping anyway.
> 
All true. I'll experiment with some throttling approaches.

Cheers,

Simon.

>