disadvantages of a linux cluster

Wed Nov 6 13:33:06 PST 2002

On Wed, 6 Nov 2002, Paul Redfern wrote:

> On our first 256-processor Dell cluster, we worked with Intel who
> provided a special service that collected all machine errors, both
> hardware and software, every night. Intel collected the error logs and,
> at regular intervals, took the tags off them and sent them to MIT for
> independent analysis. The first four months (initial period of analysis)
> with Windows 2000 Advanced Server, MIT reported 99.9986% uptime.  Since
> then, the machine got hardened and reliability for it and our other
> clusters, has gotten better, not worse.  We've been operating Windows
> 2000 clusters since the server OS was first introduced. Typically
> outages are handled in less than ten minutes on one node with spare
> memory and hard drives. Outages don't affect the overall cluster; the
> scheduler works around it, and the cluster continues to run. The HPC

Right, this is what I meant.  Your 99.998** % uptime is an artifact.
Just rebooting each node one time over four months, for any reason
whatsoever, would produce less uptime than this.  

Also, you guys must be really really good at hardware -- it generally
takes me ten minutes (or more -- sometimes even a LOT more) to notice
that a node has died, ten more minutes to get to the node in question
assuming I'm actually in the building and not teaching at the time, ten
minutes to pull the node out of the rack and bench it -- and then I have
to figure out WHAT died on the node, which could be just about anything
and could take anything from ten more minutes to hours or days, since I
don't have spare parts handy for everything that could fail or an
(operationally wasted) hot spare, and lacking expensive bench tools the
only way to figure out WHICH part has failed is often to swap good parts
in until the bad one is identified, which can take hours of focused
effort.

Why, in the absolute best of all universes where I'm watching the logs
in real time as a disk starts to throw errors and I happen to be in the
room and shut it down immediately and have a spare disk handy it would
STILL take me more than ten minutes to unbutton the node, slap in the
disk, and button it up again, and then there is the time required to
reinstall the operating system on the node (another ten minutes) and the
final reboot into operation.  Why, I doubt that I could get a node back
up and online in less than an hour, assuming I was damn near sitting on
top of it UNLESS I kept a hot spare (which then needs to be factored
into your uptime as DOWN time).  So ten minutes of hardware downtime per
failure isn't good -- it is superhuman.

Then I also suppose that we have to assume that you have a hugish staff
of hardware elves on call who are instantly alerted when a disk fails or
a CPU smokes on one of the 256 systems in the middle of the night, so
that it is fixed with a TOTAL DOWNTIME of ten minutes.  Those of us in
the real world who don't have our own elves often just wait until the
next day (indeed, the truly lazy amongst us would likely not even notice
until after breakfast and a slow drive to work, followed by as long as
it takes to clear our email and actually check on the cluster:-), but of
course that would cost one 8-10 node-hours of downtime and a couple of
instances of >>that<< would ruin your claimed numbers.

Your numbers are puzzling to me on the basis of pure hardware failure
alone.  FWIW, we had a 16 node Dell Poweredge 2300 cluster given to us
as part of an Intel equipment grant.  Within 18 months, I'd lost 16 ECC
DIMMS, three disks, a motherboard, two CPUs, and a NIC, each of which
crashed the (dual) node and rendered two CPUs inoperable in the cluster
until the node was fixed. This had nothing whatsoever to do with
operating system or task -- it was just "normal" hardware failure
(although the cluster duty cycle when up was well into the high .9's,
where I'm not talking about "uptime", I'm talking about duty cycle --
cycles consumed in useful tasks over cycles available).  Even allowing
for the hardware downtime we were in the high .9's overall out of all
cycles that COULD have been consumed in a perfect cluster, but each
downed hardware incident took close to a day (or even more) to resolve
because we had a service contract that specified mailback of parts
instead of hardware fairies on 24 hour call, and because we didn't have
a "hot spare's" worth of spare parts sitting around idle.  Still, a
couple of idle days per node over 18 months isn't half bad...especially
if reducing those two idle days still further would involve spending a
LOT more money on e.g. hardware contracts, hot spares, or expensive
software (like Windows:-).

I therefore would prefer some real world numbers:

  a) Duty cycle (cycles consumed/cycles available).

This is a direct measure of how loaded the cluster has been during the
time it is up (where "up" is defined to be "available for work", not
"booted but unavailable due to maintenance etc.").  Note that just
because a task is onboard does not mean that this is near unity --
scheduler efficiency and task density and a few other things come into
play here, as a "loaded" task can easily spend a lot of the time idle.
Linux has these numbers readily available or derivable from readily
available numbers in /proc -- I have no idea how they would be measured
(or if they CAN be measured) in Win2k.

A more fine-grained measure would also involve the KIND of load(s) --
how much network traffic, disk I/O, CPU bound and memory bound work is
done.  Probability of software and hardware failure of most of these
subsystems is at least weakly dependent on load and heat (and physical
wear in the case of disks).  Some tasks don't stress parts of a kernel
-- do a lot of malloc/free (so that a leaky OS is a problem) for example
-- and others do.  A system that might survive three months running

main()
{
 double a;
 while(-1) { a = 1.0; }
}

might crash like crazy or get wierd on you in other ways on

main()
{
 double *a;
 while(-1) { a = (double) malloc(1048576*sizeof(double)); free(a); }
}

Task matters.

  b) Uptime, measured as (total time systems are booted into the OS and
available for numerical tasks/total mount of time ALL systems have been
around).

This means that if you have 9 systems booted and a hot spare, the best
you can count for uptime is 90%.  It also means that if a system crashes
in the middle of the night and you don't get around to fixing it until
the next day, you lose eight or twelve hours, not the ten minutes it
eventually takes you to fix it after discovering the crash, pulling the
system and diagnosing the problem. Finally, it means that if a system is
sitting idle as a "spare", is rebooted, is taken partly "down" to do
maintenance, or reinstalled, is upgraded -- if it isn't available to do
work for ANY REASON, it isn't "up".  This latter part has nothing to do
with "errors".  I might reboot a system known to leak memory like a
sieve once a day and never encounter an error BECAUSE I rebooted it once
a day before a leak could cause a crash or visible error, but it would
hardly be fair to claim that the system was available during the reboot
and base "uptime" claims on time it stayed up after this spurious boot
until the next spurious boot.

  c) Continuing this, some measure of average CONTINUOUS uptime.  For
example, of your 256 nodes, how many were never rebooted over four
months? Presumably almost all of them, of course, since a single 5
minute reboot each would trash your uptime claim all by itself.

It might also be nice to have some indication of how many of the crashes
that did occur involved hardware, and how many software (or at least
were "fixed" with a reboot, not a hardware replacement).

Of course, I can't GET real world numbers from you to compare to our own
potential experience if your cluster comes with hardware elves, hot and
cold running Intel and Dell, and so forth.  Real hardware failures
involving dead processors, motherboards, power supplies, disks, memory
require more than fourteen minutes per system per million minutes (about
two years) to deal with.  On a good decade and with high quality
hardware you might average one failure per two years, but only in an
environment that has hot and cold running humans whose time is MUCH more
costly than the benefit of a day or so of downtime is it possible to
limit the time cost of that failure to fourteen minutes, including the
reboot.  Unless you maintain hot spares, of course, but then they need
to be included in your "uptime" as purchased but unusable systems.

It is also important to differentiate between hardware problems and
OS-related crashes, as they are the RELEVANT difference between your
Microsoft cluster and a linux cluster.  Nobody on this list is likely to
be overwhelmed by a near-complete lack of software crashes over four
months.  That has been the rule, not the exception, for well-configured
linux systems for years now, with uptimes (times a system is
continuously available for work) of a year or more not uncommon.  I'm
more than just impressed, I'm totally amazed, by your claims from the
point of view of hardware reliability alone.

> manager at Microsoft recently made available special pricing/package for
> HPC clusters through the OEM channel (such as Dell, HP, IBM, Hitachi)
> that closes the gap in pricing between windows-based and linux-based
> bundles. I'd be happy to introduce you to the appropriate people at
> Microsoft or Dell.

Closes the gap?  Microsoft is giving everything away for free in open
source, GPL (or equivalent) form?  That's kind of them...;-)

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu