[Beowulf] SATA II - PXE+NFS - diskless compute nodes

Wed Dec 13 03:40:39 PST 2006

Donald Becker wrote:
> On Tue, 12 Dec 2006, Simon Kelley wrote:
> 
>> Joe Landman wrote:
>>>>> I would hazard that any DHCP/PXE type install server would struggle
>>>>> with 2000 requests (yes- you arrange the power switching and/or
>>>>> reboots to stagger at N second intervals).
> 
> Those that have talked to me about this topic know that it's a hot-button 
> for me.
> 
> The limit with the "traditional" approach, the ISC DHCP server with one of 
> the three common TFTP servers, is about 40 machines before you risk losing 
> machines during a boot.  With 100 machines you are likely to lose 2-5 
> during a typical power-restore cycle when all machines boot 
> simultaneously.
> 
> The actual node count limit is strongly dependent on the exact hardware 
> (e.g. the characteristics of the Ethernet switch) and the size of the boot 
> image (larger is much worse than you would expect).
> 
> Staggering node power-up is a hack to work around the limit.  You can
> build a lot of complexity into doing it "right", but still be rolling the 
> dice overall.  It's better than build a reliable boot system than to build
> a complex system around known unreliability.
> 
> The right solution is to build a smart, integrated PXE server that 
> understands the bugs and characteristics of PXE.  I wrote one a few years 
> ago and understand many of the problems.  It's clear to me that no matter
> how you hack up the ISC DHCP server, you won't end up with a good PXE 
> server.  (Read that carefully: yes, it's a great DHCP server; no, it's not
> good for PXE.)

Is that server open-source/free software, or part of Sycld's product? No
judgement implied, I'm just interested to know if I can download and
learn from it.

>  
>>> fwiw:  we use dnsmasq to serve dhcp and handle pxe booting.  It does a
>>> marvelous job of both, and is far easier to configure (e.g. it is less
>>> fussy) than dhcpd.
>> Joe, you might like to know that the next release of dnsmasq includes a
>> TFTP server so that it can do the whole job. The process model for the
>> TFTP implementation should be well suited to booting many nodes at once
>> because it multiplexes all the connections on the same process. My guess
>>  is that will work better then having inetd fork 2000 copies of tftpd,
>> which is what would happen with traditional TFTP servers.
> 
> Yup, that's a good start.  It's one of the many things you have to do.  
> You are already far ahead of the "standard" approach.  Don't forget flow 
> and bandwidth control, ARP table stuffing and clean-up, state reporting, 
> etc.  Oh, and you'll find out about the PXE bug that results in a 
> zero-length filename.. expect it.

It's maybe worth giving a bit of background here: dnsmasq is a
lightweight DNS forwarder and DHCP server. Think of it as being
equivalent to BIND and ISC DHCP with BIND mainly in forward-only mode
but doing dynamic DNS and a bit of authoritative DNS too. It's really
aimed at small networks which need a DNS server and a DHCP server where
the names of DHCP-configured hosts appear in the DNS but all other DNS
queries get passed to upstream recursive DNS servers (typically at an ISP).

Dnsmasq is widely used in the *WRT distributions which run in Linksys
WRT-54G-class SOHO routers, and similar "turn your old 486 into a home
router" products. It provides all the DNS and DHCP that these need in a
~100K binary that's flexible and easy to configure.

Almost coincidentally, it's turned out to be useful for clusters too. I
known from the dnsmasq mailing list that Joe Landman has used it in that
way for a long time, and RLX used it in their control-tower product
which has now been re-incarnated in HP's blade-management system. As Don
Becker points out in another message ISC's dhcpd is way too heavyweight
for his sort of stuff. The dnsmasq DHCP implementation pretty much
receives a UDP packet, computes a reply as a function of the input
packet, the in-memory lease database and the current configuration, and
synchronously sends the reply. The only time it even needs to allocate
memory is when a new lease is created: everything else manages which a
single packet buffer and a few statically-allocated data structures.
This makes for great scalability.

For the  TFTP implementation I've stayed with the same implementation
style, so I hope it will scale well too. I've already covered some of
Don's checklist, and I'll pay attention to the rest of it, within the
contraint that this has to be small and simple, to fit the primary, SOHO
router, niche.

> 
>> For ultimate scalability, I guess the solution is to use multicast-TFTP.
>> I know that support for that is included in the PXE spec, but I've never
>> tried to implement it. Based on prior experience of PXE ROMs, the chance
>> of finding a sufficiently bug-free implementation of mtftp there must be
>> fairly low.
> 
> This is a good example of why PXE is not just DHCP+TFTP.  The multicast
> TFTP in PXE is not multicast TFTP.  The DHCP response specifies the 
> multicast group to join, rather than negotiating it as per RFC2090.  That 
> means multicast requires communication between the DHCP and TFTP sections.
>  
>>> Likely with dhcpd, not sure how many dnsmasq can handle, but we have
>>> done 36 at a time to do system checking.  No problems with it.
> 
> As part of writing the server I wrote a DHCP and TFTP clients to simulate 
> high node count boots.  But the harshest test was old RLX systems: each of 
> the 24 blades had three NICs, but could only boot off of the NIC 
> connected to the internal 100base repeater/hub.  Plus the blade BIOS had a 
> good selection of PXE bugs.

By chance, I have a couple a shelves of those available for testing.
Would that be enough (48 blades, I guess) to get meaningful results?

Cheers,

Simon.