[Beowulf] SATA II - PXE+NFS - diskless compute nodes

Tue Dec 12 15:49:46 PST 2006

On Tue, 12 Dec 2006, Simon Kelley wrote:

> Joe Landman wrote:
> >>> I would hazard that any DHCP/PXE type install server would struggle
> >>> with 2000 requests (yes- you arrange the power switching and/or
> >>> reboots to stagger at N second intervals).

Those that have talked to me about this topic know that it's a hot-button 
for me.

The limit with the "traditional" approach, the ISC DHCP server with one of 
the three common TFTP servers, is about 40 machines before you risk losing 
machines during a boot.  With 100 machines you are likely to lose 2-5 
during a typical power-restore cycle when all machines boot 
simultaneously.

The actual node count limit is strongly dependent on the exact hardware 
(e.g. the characteristics of the Ethernet switch) and the size of the boot 
image (larger is much worse than you would expect).

Staggering node power-up is a hack to work around the limit.  You can
build a lot of complexity into doing it "right", but still be rolling the 
dice overall.  It's better than build a reliable boot system than to build
a complex system around known unreliability.

The right solution is to build a smart, integrated PXE server that 
understands the bugs and characteristics of PXE.  I wrote one a few years 
ago and understand many of the problems.  It's clear to me that no matter
how you hack up the ISC DHCP server, you won't end up with a good PXE 
server.  (Read that carefully: yes, it's a great DHCP server; no, it's not
good for PXE.)

> > fwiw:  we use dnsmasq to serve dhcp and handle pxe booting.  It does a
> > marvelous job of both, and is far easier to configure (e.g. it is less
> > fussy) than dhcpd.
> 
> Joe, you might like to know that the next release of dnsmasq includes a
> TFTP server so that it can do the whole job. The process model for the
> TFTP implementation should be well suited to booting many nodes at once
> because it multiplexes all the connections on the same process. My guess
>  is that will work better then having inetd fork 2000 copies of tftpd,
> which is what would happen with traditional TFTP servers.

Yup, that's a good start.  It's one of the many things you have to do.  
You are already far ahead of the "standard" approach.  Don't forget flow 
and bandwidth control, ARP table stuffing and clean-up, state reporting, 
etc.  Oh, and you'll find out about the PXE bug that results in a 
zero-length filename.. expect it.

> For ultimate scalability, I guess the solution is to use multicast-TFTP.
> I know that support for that is included in the PXE spec, but I've never
> tried to implement it. Based on prior experience of PXE ROMs, the chance
> of finding a sufficiently bug-free implementation of mtftp there must be
> fairly low.

This is a good example of why PXE is not just DHCP+TFTP.  The multicast
TFTP in PXE is not multicast TFTP.  The DHCP response specifies the 
multicast group to join, rather than negotiating it as per RFC2090.  That 
means multicast requires communication between the DHCP and TFTP sections.

> > Likely with dhcpd, not sure how many dnsmasq can handle, but we have
> > done 36 at a time to do system checking.  No problems with it.

As part of writing the server I wrote a DHCP and TFTP clients to simulate 
high node count boots.  But the harshest test was old RLX systems: each of 
the 24 blades had three NICs, but could only boot off of the NIC 
connected to the internal 100base repeater/hub.  Plus the blade BIOS had a 
good selection of PXE bugs.

Another good test is booting Itaniums (really DHCP+TFTP, not PXE).  They 
have a 7MB kernel, and a similarly large initial ramdisk.  Forget to
strip off the kernel symbols and you are looking at 70MB over TFTP.  (But 
they extend the block index from 16 to 64 bits, allowing you start a 
transfer that will take until the heat death of the universe to finish!  
Really, 32 bits is sometimes more than enough.  Especially when extending 
a crude protocol that should have been forgotten long ago.)

> Dnsmasq will handle DHCP for thousands of clients on reasonably meaty
> hardware. The only rate-limiting step is a 2-3 second timeout while
> newly-allocated addresses are "ping"ed to check that they are not in
> use. That check is optional, and skipped automatically under heavy load,
> so a large number of clients is no problem.
> 
> 
> Cheers,
> 
> Simon.
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Donald Becker				becker at scyld.com
Scyld Software	 			Scyld Beowulf cluster systems
914 Bay Ridge Road, Suite 220		www.scyld.com
Annapolis MD 21403			410-990-9993