[Beowulf] SATA II - PXE+NFS - diskless compute nodes

Thu Dec 14 14:33:33 PST 2006

On Sat, 9 Dec 2006, Joe Landman wrote:
> Guy Coates wrote:
> > At what  node count does the nfs-root model start to break down?  Does anyone
> > have any rough numbers with the number of clients you can support with a generic
> > linux NFS server vs a dedicated NAS filer?
> 
> If you use warewulf or the new perceus variant, it creates a ram disk
> which is populated upon boot.  Thats one of the larger transients.  Then
> you nfs mount applications, and home directories.  I haven't looked at
> Scyld for a while, but I seem to remember them doing something like this.

I forgot to finish my reply to this message earlier this week.  Since I'm 
in the writing mood today, I've finished it.

Just when were getting past "diskless" being being misinterpreted as
"NFS root"...

Scyld does use "ramdisks" in our systems, but calling "ramdisk based"
misses the point of the system.

Booting: RAMdisks are critical

Ramdisks are a key element of the boot system.  Clusters need reliable,
stateless node booting.  We don't want local misconfiguration, failed
storage hardware or corrupted file systems to prevent booting.

The boot ramdisk have to be small, simple, and reliable.  Large ramdisks
multiply PXE problems and have obvious server scalability issues.
Complexity is bad because the actions are happening "blind", with no easy
way to see what step when wrong.  We try to keep these images stable,
with only ID table and driver updates.

Run-time: RAMdisks are just an implementation detail

The run-time system uses ramdisks almost incidentally.  The real point
of our system is creating a single point of administration and control
-- a single virtual system.  To that end we have a dynamic caching,
consistent execution model.  The real root "hypervisor" operates out of
a ramdisk to be independent of the hardware and storage that might be
used by application environments.  The application root and caching
system default to using ramdisks, but they can be configured to use
local or network storage.

The "real root" ramdisk is pretty small and simple.  It's never seen by
the applications, and only needs to keeps it own housekeeping info.  The
largest ramdisk is the system is the "libcache" FS.  This cache starts
out empty.  As part of the node accepting new applications, the
execution system (BProc or BeoProc) verifies that correct version of
executable and libraries are available locally.  By the time the node
says "yah, I'll accept that job" it has cached the exact version it
needs to run.  (*)

So really we are not using a "ramdisk install".  We are dynamically
detecting hardware, and loading the right kernel and device drivers
under control of the boot system.  Then we are creating an minimal custom
"distribution" on the compute nodes.

The effect is the same as creating a minimal custom "distribution" for
that specific machine -- an installation that has only the kernel,
device drivers and applications to be run on that node.

This approach to dynamically building an installation is feasible and
efficient because another innovation: a sharp distinction between full,
standard "master" nodes and lightweight compute "slave" nodes.  Only
master nodes run the full, several-minute initialization to start
standard services and daemons.  ("How many copies of crond do you
need?")  Compute slaves exist only run only the end applications, and
have a master with it's full reference install to fall on when they need
to extend their limited environment.

* Whole file caching is one element of the reliability model.  It means
we can continue to run even if that master stops responding, or replaces
a file with a newer version.  We provide a way for sophisticated sites
to replace the file cache with a network file system, but then the file
server must be up to continue running and you can run into
versioning/consistency issue.

RAMdisk Inventory

We actually have five (!) different types of ramdisks over the system
(see the descriptions below).  But it's the opposite of the Warewulf
approach.  Our architecture is a consistent system model, so we
dynamically build and update the environment on nodes.  Warewulf-like
ramdisk system only catch part of what we are doing:

The Warewulf approach
  - Uses a manually selected subset distribution on the compute node 
ramdisk.
    While still very large, it's never quite complete.  No matter how 
useless
    you think some utility is, there is probably some application out 
there
    that depends on it.
  - The ramdisk image is very large and it has to be completely downloaded 
at
    boot time just when the server is extremely.
  - Supplements the ramdisk with NFS, combining the problems of both.(*)  
The
    administrator and users to learn and think about how both fail.

(*1) That said, combining a ramdisk root with NFS is still far more
scalable and somewhat more robust than using solely NFS.  With careful
administration most of the executables will be on the ramdisk, allowing
the server to support more nodes and reducing the likelihood of
failures.

The phrase "careful administration" should be read as "great for demos,
and when the system is first configured, but degrades over time".  The
type of people that leap to configure the ramdisk properly the first
time are generally not the same type that will be there for long-term
manual tuning.  Either they figure out why we designed around dynamic,
consistent caching and re-write, or the system will degrade over time.

Ramdisk types

For completeness, here are the five ramdisk types in Scyld:
   BeoBoot stage 1: (The "Booster Stage")
     Used only for non-PXE booting.
     Now obsolete, this allowed network booting on machines that didn't
     have it built in.  The kernel+ramdisk was small enough to fit on
     floppy, CD-ROM, hard disk, Disk-on-chip, USB, etc.
     This ramdisk image that contains NIC detection code and tables,
     along with every NIC driver and a method to substitution kernels.
     This image must be under 1.44MB, yet include all NIC drivers.

   BeoBoot stage 2 ramdisk:
     The run-time environment set-up, usually downloaded by PXE.
     Pretty much the same NIC detection code as the stage 1 ramdisk, 
except
     potentially optimized for only the NICs known to be installed.  The
     purpose of this ramdisk is to start network logging ASAP and then
     contact the master to download the "real" run-time environment.
     When we have the new environment we pivotroot and delete this whole
     ramdisk.  We've used the contents we cared about (tables & NIC 
drivers),
     and just emptying ramdisks frequently leaks memory!
     It's critical that this ramdisk be small to minimize TFTP traffic.

   Stage 3, Run-time environment supervisor
     (You can call this the "hypervisor".)
     This is the "real" root during operation, although applications
     never see it.
     The size isn't critical because we have full TCP from stage 2 to
     transfer it, but it shouldn't be huge because
      - it will compete with other, less robust booting traffic
      - the master will usually be busy
      - large images will delay node initialization

   LibCache ramdisk:
     This is a special-purpose file system used only for caching
     executables and libraries.  We designed the system with a separate
     caching FS to optionally switch to caching on a local hard disk
     partition.  That was useful with 32MB memory machines or when doing
     a rapid large-boot demo, but the added complexity is rarely useful on
     modern systems.

   Environment root:
     This is the file system the application sees.  There is different
     environment for each master the node supports, or potentially even
     one for each application started.
     By default this is a ramdisk configured as a minimal Unix root by
     the master.  The local administrator can change this to be a local
     or network file system to have a traditional "full install" 
environment,
     although that discards some of the robustness advances in Scyld.

>  Scyld requires a meatier head node as I remember due to its launch 
model.

Not really because of the launch model, or the run-time control.  It's to
make the system less complex and simpler to use.

Ideally the master does less work than the compute nodes because they
are doing the computations.  In real life people use the master for
editing, compiling, scheduling, etc.  It's the obvious place to put home
directories and serve them to compute nodes. And it's where the
real-life cruft ends up, such as license servers and reporting tools.

Internally each type of service has it's own server IP address and port.
We could point them to replicated masters or other file servers.  They
just all point to the single master to keep things simple.  For
reliability we can have cold, warm or hot spare masters.  But again,
it's less complex to administer one machine with redundant power
supplies and hot-swap RAID5 arrays.  All this makes the master node look
like the big guy.

-- 
Donald Becker				becker at scyld.com
Scyld Software	 			Scyld Beowulf cluster systems
914 Bay Ridge Road, Suite 220		www.scyld.com
Annapolis MD 21403			410-990-9993