[Beowulf] Compute Node OS on Local Disk vs. Ram Disk

Tue Sep 30 11:53:22 PDT 2008

On Sun, 28 Sep 2008, Jon Forrest wrote:

> There are two philosophies on where a compute node's
> OS and basic utilities should be located:
> 1) On a local harddrive
> 2) On a RAM disk
> I'd like to start a discussion on the positives
> and negatives of each approach. I'll throw out
> a few.
> 
> Both approaches require that a compute node "distribution"
> be maintained on the frontend machine. In both cases
> it's important to remember to make any changes to this
> distribution rather than just using "pdsh" or "tentakel"
> to dynamically modify a compute node. This is so that the
> next time the compute node boots, it gets the uptodate
> distribution.

Ahhh, your first flawed assumption.

You believe that the OS needs to be statically provisioned to the nodes.
That is incorrect.

A compute node only needs what it will actually be running
  - a kernel and device drivers that match the hardware
  - kernel support for non-hardware-specific features (e.g. ext3 FS)
  - a file system that presents a standard application environment
    (The configuration files that the libraries depend upon 
     e.g. a few files in /etc/*, a /dev/* that matches the hardware,
     a few misc. directories)
  - the application executable and libraries it links against
  - application-specific file I/O environment (usually /tmp/ and a
    few data directories)

You can detect the first and most of the second category at node boot 
time.  The kernel is loaded into memory and kernel modules are 
immediately linked in, so there isn't any reason to keep them around as a 
file system.

The third category does need to be a file system, but it's tiny and 
changes infrequently.  It can easily provisioned, or even dynamically 
created, at node boot.

The fourth category is an interesting one.  You don't have to statically 
provision it at boot time, or mount a network file system.  When you issue 
a process to a node, the system that accepts the process can check that 
it has the needed executable and libraries.  Better, it can verify that it 
has the correct versions.  And this is the best time to check, because we 
can ask the sending machine for a current copy if we don't have the 
correct version.  By having a model for "execution correctness" we 
simultaneously eliminate one source of version skew and eliminate the need 
to pre-load executables and libraries that will be unused or updated 
before use.  Plus we automatically have a way to handling newly added 
applications, libraries and utilities without rebooting compute nodes.

> Assuming the actual OS image is the same in both cases,
> #2 clearly requires more memory than #1.

No, it can require substantially less.  It only requires more if you
assume the naive approach of building a giant RAMdisk with everything you
might need.  If you think of an alternative model where you are just
caching the elements needed to do a job, the memory usage is less.

Think of a compute node as part of a cluster, not a stand-alone machine.  
The only times that it is asked to do something new (boot, accept a new
process) it's communicating with a fully installed, up-to-date master 
node.  It has, at least temporarily, complete access to a reference 
install.  It can take that opportunity to cache or load elements that 
doesn't have, or has an obsolete version of.

There might be some dynamic elements needed later e.g. name service 
look-ups, but these should be much smaller than the initial provisioning 
and the correct/consistency model is inherently looser. 

> Long ago not installing a local harddrive saved a considerable
> about of money but this isn't true anymore. Systems that need
> to page (or swap) will require a harddrive anyway since paging
> over the network isn't fast enough so very few compute nodes
> will be running diskless.

The hardware cost of a local hard drive wasn't really an issue.  It has 
always been the least expensive I/O bandwidth available.  The real cost is 
installing, updating and backing up the drive.  If you design a cluster 
system that installs on a local disk, it's very difficult to adapt it to 
diskless blades.  If you design a system that is as efficient without 
disks, it's trivial to optionally mount disks for caching, temporary files 
or application I/O.

> Approach #2 requires much less time when a node is installed,
> and a little less time when a node is booted.

We've been able to start diskless compute nodes in
  <BIOS memory count> + <PXE 2 seconds> + 750 milliseconds  (!)

To be fair, that was on blades without disk controllers, and just
Ethernet.  Scanning for local disks, especially with a SCSI layer, can
take many seconds.  Once you detect a disk it takes a bunch of slow seeks
to read the partition table and mount a modern file system (not EXT2).  
So trimming the system initialization time further isn't a priority until 
after the file system and IB init times are shortened.

-- 
Donald Becker				becker at scyld.com
Penguin Computing / Scyld Software
www.penguincomputing.com		www.scyld.com
Annapolis MD and San Francisco CA