[Beowulf] SATA II - PXE+NFS - diskless compute nodes

Thu Dec 14 13:07:14 PST 2006

On Thu, 14 Dec 2006, Simon Kelley wrote:
> Donald Becker wrote:
> > It should repeat this: forking a dozen processes sounds like a good idea.
> > Thinking about forking a thousand (we plan every element to scale to "at 
> > least 1000") makes "1" seem like a much better idea.
> > 
> > With one continuously running server, the coding task is harder.  You 
> > can't leak memory.  You can't leak file descriptors.  You have to check for 
> > updated/modified files.  You can't block on anything.  You have to re-read
> > your config file and re-open your control sockets on SIGHUP rather than 
> > just exiting.  You should show/checkpoint the current state on SIGUSR1.
> 
> All that stuff is there, and has been bedded down over several years.
> The TFTP code is an additional 500 lines.

It's not difficult to write a TFTP server. (The "trivial" in the name
is a hint for those that haven't tried it.)  It's difficult to write a
reliable scalable one.  But you have a head start.

> > Once you do have all of that written, it's now possible, even easy, to 
> > count have many bytes and packets were sent in the last timer tick and to 
> > check that every client asked for and received packet in the last half
> > second.  Combine the two and you can smoothly switch from bandwidth 
> > control to round-robin responses, then to slightly deferring DHCP 
> > responses.
> 
> I'm not quite following here: It seems like you might be advocating
> retransmits every half second. I'm current doing classical exponential
> backoff, 1 second delay, then two, then four etc. Will that bite me?

Where are you you doing exponential back-off?  For the TFTP client?
The TFTP client will/should/might do a retry every second.  (Background:
TFTP uses "ACK" of the previous packet to mean "send the next one".  The
only way to detect this is a retry is timing.) The client might do a
re-ARP first.  In corner cases it might not reply to ARP itself.

[[ Step up on the soapbox. ]]

What idiot thought that exponential backoff was a good idea?
Exponential backoff doesn't make sense where your base time period is a
whole second and you can't tell if the reason for no response is
failure, busy network or no one listening.

My guess is that they were just copying Ethernet, where modified,
randomized exponential backoff is what makes it magically good.
Exponential backoff makes sense at the microsecond level, where you have
a collision domain and potentially 10,000 hosts on a shared ether.  Even
there the idea of "carrier sense" or 'is the network busy' is what
enables Ethernet to work at 98+% utilization rather than the 18% or 37%
theoretical of Aloha Net.  (Key difference: deaf transmitter.)

What usually happens with DHCP and PXE is that the first packet is used
getting the NIC to transmit correctly.  The second packet is used to get
the switch to start passing traffic.  The third packet get through but we
are already well into the exponential fallback.

PXE would be much better and more reliable if it started out
transmitting a burst of four DHCP packets even spaced in the first
second, then falling back to once per second.  If there is a concern
about DHCP being a high percentage of traffic in huge installations
running 10baseT, tell them to buy a server. Or, like, you know, a
router.  Because later the ARP traffic alone will dwarf a few DHCP
broadcasts.

> I'm doing round-robin, but I don't see how to throttle active
> connections: do I need to do that, or just limit total bandwidth?

Yes, you need to throttle active TFTP connections.  The clients
currently winning can turn around a next-packet request really quickly.
If a few get in lock step, the server will have the next chunk of the
file warm in the cache.  This is the start of locking out the first
loser.

You can't just let the ACKs queue up in the socket as a substitute for
deferring responses either.  You have to pull them out ASAP and mark
that client as needing a response.  This doesn't cost very much.  You
need to keep the client state structure anyway.  This is just one more
bit, plus updating the timeval that you should be keeping anyway.

> >> It's maybe worth giving a bit of background here: dnsmasq is a
> >> lightweight DNS forwarder and DHCP server. Think of it as being
> >> equivalent to BIND and ISC DHCP with BIND mainly in forward-only mode
> >> but doing dynamic DNS and a bit of authoritative DNS too.
> >
> > One of the things we have been lacking in Scyld has been an external 
DNS
> > service for compute nodes.  For cluster-internal name look-ups we
> > developed BeoNSS.
> Dnsmasq is worth a look.

We likely can't leverage anything there.  We already have a name
system in BeoNSS.  We just need the gateway from this NSS to DNS
queries.

> > BeoNSS uses the linear address assignment of compute nodes to
> > calculate the name or IP address e.g. "Node23" is the IP address
> > of Node0 + 23.  So BeoNSS depends on the assignment policy of
> > the PXE server (1).
> To do that with dnsmasq you'll have to nail down the IP address
> associated with every MAC address.
..
> standard.) OTOH if you use dnsmasq to provide your name service you
> might not need the linear assignment.

I consider naming and numbering an important detail.

The freedom to assign arbitrary names and IP addresses is a useful
flexibility in a workstation environment.  But for a compute room or
cluster you want regular names and automatic-but-persistent IP
addresses.

We assign compute nodes a small integer node number the first time we
accept them into the cluster.  This is the node's persistent ID unless
the administrator manually changes it.

We used to allow node specialization based on MAC address as well as
node number.  The idea was the MAC address identified the specific
machine hardware (e.g. extra disks or a frame buffer, server #6 of 16 in
a PVFS array), while the node number might be used to specialize for a
logical purpose.

What we quickly found was that mostly-permanent node number assignment
was a useful simplification.  We deprecated using MAC specialization in
favor of the node number being used for both physical and logical
specialization. 

Just like you don't want your home address to change when a house down
the street burns down, you don't want node IP addresses or node
numbering to change.  But you want automatic numbering when the street
is extended or a new house is built on a vacant lot, with a manual
override saying this house replaces the one that burnt down.

[[ Do I get extra points for not using an automotive analogy?  I can
throw them away with "You don't care about the cylinder numbering in
your car.  But it's useful to have them numbered when you replace the
spark plug cables." ]]

> > (1) This leads to one of the many details that you have to get right.
> > The PXE server always assigns a temporary IP address to new nodes.  
Once
> > a node has booted and passed tests, we then assign it a permanent node
> > number and IP address.  Assigning short-lease IP addresses then 
changing a
> > few seconds later requires tight, race-free integration with the DHCP
> > server and ARP tables.  That's easy with a unified server, difficult 
with
> > a script around ISC DHCP.
>
> Is this a manifestation of the with-and-without-client-id problem? PXE
> sends a client-id, but the OS doesn't, or vice-versa. Dnsmasq has nailed
> down rules which work in most cases of this, mainly by trail-and-error.

No, it's a different issue.

PXE does have UUIDs, a universally unique ID that is distinct from MAC
addresses.  If you implement from the spec, you can use the UUID to pass
out IP addresses and avoid the messiness of using the MAC address.

I know I have the first machine built with the feature.  It has the UUID
with all zeros :-O.  Then I have a whole bunch of other machines that
must have been built for other universes because they have exactly the
same all-zeros ID.

Even when the UUID is distinct, it doesn't uniquely ID the machine.
Different NICs on the same machine have different UUIDs, meaning you can
not detect that it's the same machine you got a request from a few
seconds ago.

Bottom line: UUIDs are wildly useless.

We address the multi-NIC case, along with a few others, by only
assigning a persistent node number after the machine boots and runs a
test program.  The test program is elegantly simple: a Linux-based DHCP
client.  The request packets have an option field of all MAC addresses.
(BTW, this is the same DHCP client code originally written to do PXE
scalability tests.)

-- 
Donald Becker				becker at scyld.com
Scyld Software	 			Scyld Beowulf cluster systems
914 Bay Ridge Road, Suite 220		www.scyld.com
Annapolis MD 21403			410-990-9993