Bonded head nodes

Donald Becker becker at
Fri Nov 8 16:01:21 PST 2002

On Fri, 8 Nov 2002, Greg Lindahl wrote:

> > I recently wrote both a TFTP client and server.  I initially did some
> > research on what already existed, and was surpised at high ratio of talk
> > vs. implementation of multicast TFTP.  "Interoperating with" isn't the
> > same as "currently implements".
> Any chance that you could write up a little document about what can be
> done to TFTP clients and servers to make them more reliable? A while ago
> I found out, the hard way, that the standard Linux server and some random
> embedded board's client did not want to work when the server was my laptop
> with a (slow) PCMCIA ethernet card. I've also seen that it's easy for
> a server to get congested enough that clients give up.

One of the motivations for doing the multi-stage kernel-TCP+Monte based
boot, rather than using TFTP, was that TFTP starts failing somewhere
around 20-30 nodes.  The primary problems are
  TFTP timeouts, which are not specified in the RFCs
  Switch overload and packet dropping
  Server packet overload, packet dropping, and bandwidth capture

A few years ago I had hoped TFTP would just go away, but with PXE we now
have to make it work.
Ways to address (read that as "almost solve") the TFTP problems are
  Implementing the "timeout" option to TFTP.
  Better switches, link flow control, and gigabit Ethernet
  Better OS queue layer and a carefully written TFTP server

> As for multicast TFTP, it's a weird beast -- you have to synchronize
> all your nodes (so much for "ripple booting" to minimize the power
> surge), and if you have packet loss somewhere, it's usually the case
> that everyone gets hurt. Neither of these are optimal things to do.

One problem with multicast is that no one uses it because it's
frequently partially broken, and the brokenness isn't important to fix
because no one uses it.
Another aspect is that multicast packets are the first type a switch
will drop, and higher-end switches usually handle multicast packets in
firmware rather than hardware.

An additional problem with TFTP and all multicast protocols is that they
must be handled by the server on a packet-by-packet basis.  You can't
use sendfile() or TCP transmit offload, and a cluster is one of the
situations where both of these work great.

For those wanting to do a little research, here are some comments from
our client and server source code:

 * The file implements functions to serve files using TFTP.
 * See RFC1350, RFC2090, RFC2347, RFC2348, RFC2349, etc. and the
 * Preboot Execution Environment (PXE) Specification v2.1.
 *  This server supports only octet mode and 512 byte transfer blocks.
 *  This TFTP client uses the new options defined in
 *  RFC2347, RFC2348, and RFC2349.
 *  TFTP options are defined in RFC1782
 *  TFTP blocksize option defined in RFC2348 (obsoletes 1783).
 *  See RFC2090 for the multicast info.
 * Compare to the basic implementation at
 * That implementation does not handle options and is easily confused
 * by unexpected packets.

Donald Becker				becker at
Scyld Computing Corporation
410 Severn Ave. Suite 210		Scyld Beowulf cluster system
Annapolis MD 21403			410-990-9993

More information about the Beowulf mailing list