[Beowulf] Should I go for diskless or not?

Thu May 14 14:51:24 PDT 2009

Dr Cool Santa wrote:
> I have a cluster of identical computers. We are planning to add more nodes
> later. I was thinking whether I should go the diskless nodes way or not?
> Diskless nodes seems as a really exciting, interesting and good option,
> however when I did it I needed to troubleshoot a lot. I did fix it up, but I
> had to redo the filesystem, but the past experiences didn't make much of a
> difference. I still need to fix up everything, I kinda need your help to
> decide.
> Also, performance wise, I was thinking that diskless is not a good option,
> and since performance matters . . .
> Can somebody outline the pros and cons of each or just give me thier
> opinion.
> 

I was in a very similar position and I ended up with disks, but the entire OS
and boot process was diskless.  It was nice that if a disk died I could just
reboot and the only difference was that there wasn't local scratch and swap
space available.

I'd ball park the costs involved in having disks as somewhere around:
* 1 man week (unless you switch to a cluster environment that does this out of
  the box)
* $35 per node (although in the real world seems like vendors really want
  to put a disk in each box, so the cost can be as low as zero)
* a few watts per node
* somewhere around a 1% per year failure, maybe 1 hour and $35 to replace,
  maybe 2 hours and $0 if you have to call support, run some stupid diagnostic
  program, get an RMA, print a label, box it up, drop it for the local
  shipper, wait a week or two, unbox, install.

The down side of diskless is:
* Somewhat lower reliability (at least if you do it yourself), has this
  changed?  Is the netblock driver stable?  Is no swap or swapping over
  net stable?
* 60-120MB/sec of bandwidth (for sequential read/writes) or 100-200 random
  disk ops is quite handy if you actually use it.
* When I tried it the kernel wasn't really guaranteed to work with remote
  swap, there was a network block layer that looked rather immature and
  claimed it would avoid the I need an allocate a buffer so I can talk to
  the network so I can swap and have more memory problem.  Swapping over
  network might well require a custom kernel compile.

The good side of diskless:
* Boots are faster (usually), no fsck.  My boot up process was pretty much
  pxeboot, linux kernel+initrd over the wire, swap pivot root over to nfs,
  launch sshd and the batch queue.  I believe this took around 11 seconds
  after the pxe transfer was done.
* no spinning disk (no disk failures and few watts less heat)

I'll respond to a few other comments I saw on this thread:

Jan - " - avoids network traffic (no NFS-Root, no /usr-mounts over NFS or such
stuff... )", technically true.  But in the HPC use I've seen this is
approximately zero.  I'd be shocked if in any normal HPC/cluster use that /usr
reads and writes was even 1% of the traffic.... unless the network was mostly
idle, and if it's idle who cares?

Brian Ropers-Huilman - "The only performance issues are that you consume
memory for the OS, which takes it away from applications."  I don't see how
this is true.  It's not like you have to carve out 500MB for a big ram disk.
You still have a buffer cache that would be no bigger and no smaller than a
node with a disk.  In both diskfull and diskless you can (and often do) run
nfs (or some other network filesystem), the main difference is instead of
talking to a sata drive to read/write blocks locally you talk to the nfs
driver and read/write over the network.

Doug:  Diskless provisioning is usually easier to manage.

Hmm, not sure I buy that one. Pretty much any decent cluster distribution should:
* allow you to add a compute node without much more than plugging it in
  and telling it to PXE boot (diskless or diskfull)
* Allow you to push a configuration cluster wide
* allow you to reinstall/reboot all nodes.

Sure installing 100 diskfull nodes takes more network bandwidth than booting
100 diskless nodes.  The flip side is booting 100 diskless nodes takes more
bandwidth then 100 disk nodes.  For practical uses of clusters either way
booting/installation is approximately 0% of the annual network bandwidth.
Certainly installing 1000 nodes from a single fileserver would take quite
awhile, but various technologies (bit-torrent and broadcast) remove that
bottleneck.  I'm kinda curious since diskless nodes don't really install how
do they handle heterogeneous hardware.  Say a dead motherboard comes back with
a new pci-id or 3?

Doug:  In general diskless is faster.

At what?  Diskless booting is faster than diskfull booting or diskfull
installation.  Application performance should be pretty much the same.  The
worst case scenarios like installing 1000 compute nodes from the head node are
usually dealt with by using broadcasts or bit-torrent.

So it depends, my current thinking is that it's not worth the man hours to do
it yourself unless you have a larger cluster.  If it's supported by your
cluster distribution it could easily be worth it.  In the whole scheme of
things I'd worry about diskless last.  Decide on your cluster distribution
based on your application and user needs, systems administrator experience,
and budget.