[Beowulf] SATA II - PXE+NFS - diskless compute nodes
becker at scyld.com
Wed Dec 13 15:00:53 PST 2006
On Wed, 13 Dec 2006, Simon Kelley wrote:
> Donald Becker wrote:
> > On Tue, 12 Dec 2006, Simon Kelley wrote:
> >> Joe Landman wrote:
> >>>>> I would hazard that any DHCP/PXE type install server would struggle
> >>>>> with 2000 requests (yes- you arrange the power switching and/or
> >>>>> reboots to stagger at N second intervals).
> > The limit with the "traditional" approach, the ISC DHCP server with one of
> > the three common TFTP servers, is about 40 machines before you risk losing
> > machines during a boot. With 100 machines you are likely to lose 2-5
> > during a typical power-restore cycle when all machines boot
> > simultaneously.
> > The right solution is to build a smart, integrated PXE server that
> > understands the bugs and characteristics of PXE. I wrote one a few years
> Is that server open-source/free software, or part of Sycld's product? No
> judgement implied, I'm just interested to know if I can download and
> learn from it.
When I wrote the first implementation I expected that we would be
publishing it under the GPL or a similar open source license, as we had
with most of our previous software.
But the problems we had with Los Alamos removing the Scyld name and
copyright from our code (the Scyld PXE server uses our "beoconfig"
config file interface, which is common to both BProc and BeoBoot) caused
us to not publish the code initially. And as often happens, early
decisions stick around far longer than you expect.
At some point we may revisit that decision, but it's not currently
a priority. I have been very willing to talk with people about the
implementation, although only people such as Peter Anvin (pxelinux) and
Marty Conner (Etherboot) don't quickly find a reason to "freshen their
drink" when I start ;->.
> >>> fwiw: we use dnsmasq to serve dhcp and handle pxe booting. It does a
> >>> marvelous job of both, and is far easier to configure (e.g. it is less
> >>> fussy) than dhcpd.
The configuration files issue was one of the triggering reasons for
investigating writing our own server.
Until 2002 we were focused on BeoBoot as the solution for booting nodes,
and PXE was a side thought to support a handful of special machines, such
as the RLX blades.
As PXE became common we went down the path of using our config file
to generate ISC DHCP config files. This broke one of my rules:
avoid using config files to write other config files. You can't trace
updates to their effects, and can't trace problems to their source. This
was a test that proved the rule: we had three independent
ways to write the config files to have backups if/when we encountered a
bug. But that meant three programs were broken each time the ISC DHCP
config file changed incompatibly.
> >> Joe, you might like to know that the next release of dnsmasq includes a
> >> TFTP server so that it can do the whole job. The process model for the
> >> TFTP implementation should be well suited to booting many nodes at once
> >> because it multiplexes all the connections on the same process. My guess
> >> is that will work better then having inetd fork 2000 copies of tftpd,
> >> which is what would happen with traditional TFTP servers.
> > Yup, that's a good start. It's one of the many things you have to do.
It should repeat this: forking a dozen processes sounds like a good idea.
Thinking about forking a thousand (we plan every element to scale to "at
least 1000") makes "1" seem like a much better idea.
With one continuously running server, the coding task is harder. You
can't leak memory. You can't leak file descriptors. You have to check for
updated/modified files. You can't block on anything. You have to re-read
your config file and re-open your control sockets on SIGHUP rather than
just exiting. You should show/checkpoint the current state on SIGUSR1.
Once you do have all of that written, it's now possible, even easy, to
count have many bytes and packets were sent in the last timer tick and to
check that every client asked for and received packet in the last half
second. Combine the two and you can smoothly switch from bandwidth
control to round-robin responses, then to slightly deferring DHCP
> It's maybe worth giving a bit of background here: dnsmasq is a
> lightweight DNS forwarder and DHCP server. Think of it as being
> equivalent to BIND and ISC DHCP with BIND mainly in forward-only mode
> but doing dynamic DNS and a bit of authoritative DNS too.
One of the things we have been lacking in Scyld has been an external DNS
service for compute nodes. For cluster-internal name lookups we
BeoNSS uses the linear address assignment of compute nodes to
calculate the name or IP address e.g. "Node23" is the IP address
of Node0 + 23. So BeoNSS depends on the assignment policy of
the PXE server (1).
BeoNSS works great, especially when establishing all-to-all communication.
But we failed to consider that external file and license servers might not
be running Linux, and therefore couldn't use BeoNSS. We now see that
we need DNS and NIS (2) gateways for BeoNSS names.
(1) This leads to one of the many details that you have to get right.
The PXE server always assigns a temporary IP address to new nodes. Once
a node has booted and passed tests, we then assign it a permanent node
number and IP address. Assigning short-lease IP addresses then changing a
few seconds later requires tight, race-free integration with the DHCP
server and ARP tables. That's easy with a unified server, difficult with
a script around ISC DHCP.
(2) We need NIS or NIS+ for netgroups. Netgroups are use to export file
systems to the cluster, independent of base IP address and size changes.
> Almost coincidentally, it's turned out to be useful for clusters too. I
> known from the dnsmasq mailing list that Joe Landman has used it in that
> way for a long time, and RLX used it in their control-tower product
> which has now been re-incarnated in HP's blade-management system.
I didn't know where it was used. It does explain some of the Control
> receives a UDP packet, computes a reply as a function of the input
> packet, the in-memory lease database and the current configuration, and
> synchronously sends the reply. The only time it even needs to allocate
> memory is when a new lease is created: everything else manages which a
> single packet buffer and a few statically-allocated data structures.
> This makes for great scalability.
You might consider breaking the synchronous reply aspect. It's convenient
because you can build the reply into the same packet buffer as the inbound
request. But it makes it difficult to defer responses. (With DHCP you
can take the sleazy approach of "only respond when the elapsed-time is
greater than X", at the risk of encountering PXE clients with short
> style, so I hope it will scale well too. I've already covered some of
> Don's checklist, and I'll pay attention to the rest of it, within the
> contraint that this has to be small and simple, to fit the primary, SOHO
> router, niche.
You probably won't want to go the whole way with the implementation, but
hopefully I've given some useful suggestions.
> > As part of writing the server I wrote a DHCP and TFTP clients to simulate
> > high node count boots. But the harshest test was old RLX systems: each of
> > the 24 blades had three NICs, but could only boot off of the NIC
> > connected to the internal 100base repeater/hub. Plus the blade BIOS had a
> > good selection of PXE bugs.
> By chance, I have a couple a shelves of those available for testing.
> Would that be enough (48 blades, I guess) to get meaningful results?
Yes. Better, try running the server on one of the blades, serving the
other 47. Have the blade do some disk I/O at the same time. Transmeta
CPUs were not the fastest chips around, even in their prime.
[[ Hmmm, did this posting come up to RGB standards of length+detail? ]]
Donald Becker becker at scyld.com
Scyld Software Scyld Beowulf cluster systems
914 Bay Ridge Road, Suite 220 www.scyld.com
Annapolis MD 21403 410-990-9993
More information about the Beowulf