disadvantages of a linux cluster

Robert G. Brown rgb at phy.duke.edu
Sat Nov 9 16:21:32 PST 2002

On Wed, 6 Nov 2002, Jim Lux wrote:

> >
> >   b) Uptime, measured as (total time systems are booted into the OS and
> >available for numerical tasks/total mount of time ALL systems have been
> >around).
> >
> >This means that if you have 9 systems booted and a hot spare, the best
> >you can count for uptime is 90%.  It also means that if a system crashes
> >in the middle of the night and you don't get around to fixing it until
> >the next day, you lose eight or twelve hours, not the ten minutes it
> >eventually takes you to fix it after discovering the crash, pulling the
> If the cluster were claimed to have 9 processors worth of processing 
> capability, and the OS and scheduler allow transparent use of the hot 
> spare, then, you could get 100% uptime as long as you only had 1 failure.

Well sure, but then I can claim that my ganesh cluster (which currently
has only 13 nodes out of 16 running) is really only a "12 node cluster".
My OS and scheduler (the latter being "me":-) not only allow transparent
use of the hot spares, they allow transparent use of the hot spares even
when one of the 12 nodes hasn't died yet and I get to count that time up
against eventual node death.  So now I can laugh at mere nines of uptime
-- I'm well over 100%!  Well over 100% cumulative duty cycle, too.
Hooray!  I'm now even more efficient that CTC!

Of course this isn't correct, I really have 16 nodes and think keeping
hot spares sitting around idle is silly. I could choose to leave one,
two, or even four idle but configured and call them "hot spares" to pump
up my "uptime" if my goal was to show that I can keep at least 12 nodes
running out of a pool of 16 or to be able to issue a nifty press release
about "99.99846% uptime".  For the purpose of getting work done -- a
moment of reflection will surely convince you that this is a cosmically
silly and somewhat dishonest thing to do:-).

For people interested in getting work done, the ONLY THING THAT MATTERS
is the aggregate work accomplished during the useful lifetime of the
cluster, which (as has been discussed repeatedly on this list) is
somewhere in the ballpark of two or three years.  (Some claim only one
year and can even back up the claim with some real numbers; some -- like
me -- use nodes for as long as five years anyway because any CPU that
ain't dead yet can still contribute cycles, and there are all sorts of
opportunity cost and infrastructure nonlinearities that make simple
answers for the ideal optimax wrong;-)

If one has 10 nodes and deliberately leaves one node idle, the MOST work
one can get done is 90% of the work one could have gotten done with all
10 nodes cranking away all of the time.  Sure, you can lose any one node
and not get any WORSE, but you achieve this at the expense of basically
being bad all the time.  It can be sliced and diced any way one wants,
but the bottom line is that one has wasted 10% of the resource even
BEFORE a failure occurred, and one will never, ever, recover the work
that could have been done by the deliberately idled node.  The real
failure is in the brain, which left a valuable system, already paid for,
doing no work while its useful lifetime and time under warranty
frittered away.

For this reason, in my opinion, one counts hot spares as a dead loss
FROM THE BEGINNING in any fair (that is, not deliberately stupid)
assessment of cluster "uptime".  One does not get to "pad" one's uptime
just because one can quickly insert a node you've paid for but are
leaving idle (that is to say, DOWN) unless/until you can show that there
exists ANY circumstance where you are likely to get more net work done,
per dollar spent, that way instead of just artificially bumping some
otherwise irrelevant numbers.

CBA, CBA, CBA, with a clear statement of one's work goals and one's
total means to accomplish them.  That's the only way to do fair
comparisons.  Otherwise we might as well all buy Crays, because they are
big, expensive, and come with hot and cold hardware elves.  Well, maybe
we might as well NOT all buy Crays because most of us just plain can't
afford them!

I think that one can make a strong economic case for NEVER purchasing a
service contract for a cluster, and NEVER buying and holding idle spare
parts (beyond, perhaps, a hard disk and DIMM or two), and for (in fact)
having the cluster "eat its own dead" -- using dead nodes to repair
nodes as they die and gradually permitting the number of nodes to
shrink.  This is because of Moore's Law, which rather brutally punishes
node repair compared to purchasing new nodes pretty much anytime after
the typical one year warranty of a node expires.

This is probably a bit too extreme for true optimax behavior --
replacing a memory DIMM or a power supply or a hard drive for order of
$100 or less to get another year's use out of a node is almost certainly
worth it in the first or second year of a node's existence, but maybe by
the third and certainly by the fourth it is a waste of time and money --
you're better off putting the $100 into the kitty for a new node that is
likely 8x as powerful as the node you're replacing.

Now, before anyone brings it up, I will cheerfully admit that there are
SOME cases of parallel computation that really might get unhappy if the
cluster is supposed to have "16" nodes and one dies and the number
available goes to "15" (or 256 nodes goes down to 255 or whatever).
Those cases are rare, of course, and invariably are fine grained
synchronous cases where the computation itself globally fails when any
node goes down (so "failover" mechanisms other than checkpointing of the
code itself are a waste of time) but they exist.

EVEN THEN one would have to do the CBA to convince me that hot spares
are a net-productivity-increasing investment, but at least then I'd be
willing to accept the possibility, especially if the computation was
using all (say) 256 nodes and couldn't be restarted from its last
checkpoint until there are 256 nodes to run on again.  In all other
cases, especially in the typical coarse grained or embarrassingly
parallel applications or applications that don't use all N nodes of the
cluster anyway, I'd just have to say "Is that a Sears poncho or a real
poncho?  Hmmm.  No fooling."


Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

More information about the Beowulf mailing list