[Beowulf] SATA II - PXE+NFS - diskless compute nodes

Thu Dec 14 08:04:14 PST 2006

Donald Becker wrote:
>> Is that server open-source/free software, or part of Sycld's product? No
>> judgement implied, I'm just interested to know if I can download and
>> learn from it.
> 
> When I wrote the first implementation I expected that we would be 
> publishing it under the GPL or a similar open source license, as we had 
> with most of our previous software.
> But the problems we had with Los Alamos removing the Scyld name and
> copyright from our code (the Scyld PXE server uses our "beoconfig" 
> config file interface, which is common to both BProc and BeoBoot) caused 
> us to not publish the code initially.  And as often happens, early 
> decisions stick around far longer than you expect.
> 
> At some point we may revisit that decision, but it's not currently
> a priority.  I have been very willing to talk with people about the 
> implementation, although only people such as Peter Anvin (pxelinux) and 
> Marty Conner (Etherboot) don't quickly find a reason to "freshen their 
> drink" when I start ;->.

My glass is full; let us continue!

> It should repeat this: forking a dozen processes sounds like a good idea.
> Thinking about forking a thousand (we plan every element to scale to "at 
> least 1000") makes "1" seem like a much better idea.
> 
> With one continuously running server, the coding task is harder.  You 
> can't leak memory.  You can't leak file descriptors.  You have to check for 
> updated/modified files.  You can't block on anything.  You have to re-read
> your config file and re-open your control sockets on SIGHUP rather than 
> just exiting.  You should show/checkpoint the current state on SIGUSR1.

All that stuff is there, and has been bedded down over several years.
The TFTP code is an additional 500 lines.
> 
> Once you do have all of that written, it's now possible, even easy, to 
> count have many bytes and packets were sent in the last timer tick and to 
> check that every client asked for and received packet in the last half
> second.  Combine the two and you can smoothly switch from bandwidth 
> control to round-robin responses, then to slightly deferring DHCP 
> responses.

I'm not quite following here: It seems like you might be advocating
retransmits every half second. I'm current doing classical exponential
backoff, 1 second delay, then two, then four etc. Will that bite me?

I'm doing round-robin, but I don't see how to throttle active
connections: do I need to do that, or just limit total bandwidth?
>  
>> It's maybe worth giving a bit of background here: dnsmasq is a
>> lightweight DNS forwarder and DHCP server. Think of it as being
>> equivalent to BIND and ISC DHCP with BIND mainly in forward-only mode
>> but doing dynamic DNS and a bit of authoritative DNS too.
> 
> One of the things we have been lacking in Scyld has been an external DNS 
> service for compute nodes.  For cluster-internal name lookups we 
> developed BeoNSS.
Dnsmasq is worth a look.
> 
> BeoNSS uses the linear address assignment of compute nodes to
> calculate the name or IP address e.g. "Node23" is the IP address 
> of Node0 + 23.  So BeoNSS depends on the assignment policy of 
> the PXE server (1).
To do that with dnsmasq you'll have to nail down the IP address
associated with every MAC address. DHCP IP address assignment to
anonymous hosts is pseudo-random. (actually, it's done using a hash of
the MAC address. That allows repeated DHCPDISCOVERs to be offered the
same IP address without needing any server-side state until a lease is
actually allocated. Some DHCP clients depend on getting the same answer
to repeated DISCOVERs, without any support whatsoever from the
standard.) OTOH if you use dnsmasq to provide your name service you
might not need the linear assignment.
> 
> BeoNSS works great, especially when establishing all-to-all communication.  
> But we failed to consider that external file and license servers might not 
> be running Linux, and therefore couldn't use BeoNSS.  We now see that 
> we need DNS and NIS (2) gateways for BeoNSS names.
> 
> (1) This leads to one of the many details that you have to get right.
> The PXE server always assigns a temporary IP address to new nodes.  Once
> a node has booted and passed tests, we then assign it a permanent node 
> number and IP address.  Assigning short-lease IP addresses then changing a 
> few seconds later requires tight, race-free integration with the DHCP 
> server and ARP tables.  That's easy with a unified server, difficult with 
> a script around ISC DHCP.

Is this a manifestation of the with-and-without-client-id problem? PXE
sends a client-id, but the OS doesn't, or vice-versa. Dnsmasq has nailed
down rules which work in most cases of this, mainly by trail-and-error.
> 

> You probably won't want to go the whole way with the implementation, but 
> hopefully I've given some useful suggestions.

Agreed, and thanks for the pointers.

Cheers,

Simon.