Reliability analysis was RE: Windows HPC (@ Cornell)

Thu Nov 7 14:18:51 PST 2002

On Wed, 6 Nov 2002, Jim Lux wrote:

> I would think that a reasonably rigorous analysis would need to address 
> things like (re)boot time, mean time to repair, the difference between 
> "operating system up and ready" and "actually running user code", and so 
> forth. Maybe a good start would be to establish a common terminology for 
> things, and then we can argue/discuss how to boil down 
> measurements/predictions to a single "figure of merit".
> 
> RGB.. maybe another chapter for your book?

Works for me, provided that there isn't already a document out there
that does specify reasonable measure for all of this.  In which case we
should steal it rather than reinvent wheels.

The only real problems I see are that all of these "times" are
irrelevant outside of a full CBA of the operation.  If I'm willing to
pay for one person per eight nodes to care for a cluster, and to keep
one hot spare node turned on but unused 24x7, and keep at least one
human PRESENT per 24 nodes on a 24 hour basis, in a room containing LOTS
of nodes (and hence lots of fulltime humans just waiting for problems),
and install sensitive tools to detect a failure and initiate recovery,
and get to pick high quality hardware, all on UPS/conditioners, etc. I
might be able to claim four nines of "uptime" -- at a huge price per
node.  

Without a handy dandy 100 KW auxiliary generator on the premises I still
misdoubt five nines and up on anything like yearlong intervals, as
you're pushing against the reliability of the electrical grid in a lot
of places at that point (where I live in NC we will almost certainly
have at least one hourlong power outage a year, with some exceptional
years where hurricanes, ice storms, or other weather take down power for
as long as weeks).  Then there is required maintenance on whatever
chiller/AC you have -- cleaning of filters, de-icing of exchangers, and
plain old failure.  Industrial HA sites might well invest the money
required to achieve this sort of reliability for core operations because
of the huge cost of downtime.  Very few HPC computations are so
CBA-sensitive to downtime for this to be worth it, and even when it is
it is likely to be simpler (sorry, "cheaper") to engineer a reasonable
degree of fault tolerance into the parallel code than to create a
hardware-electrical-grid-AC-failure-proof cluster.

Of course, one can always achieve as many nines as you like between
accidents that shut everything down for reasons beyond your control,
right?;-)

On the other hand, that SAME cluster (high quality hardware and so
forth) run by just one person simply cannot achieve the same degree of
reliability, presuming that that one person wants to occasionally do
important things like eat, sleep, take care of essential bodily
functions.

A cluster with the same number of nodes of CHEAP hardware, no UPS etc
but WITH the hot and cold running systems persons and hot spares might
still get really good "uptime" numbers (if their local power grid is
reasonably reliable).

A cluster with the same number of nodes, cheap hardware, a single
systems person, no UPS, in an inadequately cooled former closet, might
get really bad ones.

So while it is perfectly reasonable to come up with some formal metrics,
it is equally important to record some of the environmental factors and
cost factors that EQUALLY bear on any meaningful comparison.

One set of metrics that I use here are incorporated into a wulfstat
view:

   name | CPU Model Name |ClkM|CchK| Time | Uptime |Duty%
-----------------------------------------------------------
ganesh |AMD Athlon(tm) Processor |1333| 256| 4:41:38 pm| 36d:5h:42m:56s| 94
g01 |AMD Athlon(tm) Processor |1328| 256| 4:41:38 pm| 30d:6h:48m:42s|100
g02 |AMD Athlon(tm) Processor |1328| 256| 4:41:38 pm| 30d:7h:18m:4s|100
g03 | | | | | down |
g04 |AMD Athlon(tm) Processor |1328| 256| 4:41:38 pm| 30d:6h:2m:23s|100
g05 |AMD Athlon(tm) Processor |1328| 256| 4:41:38 pm| 30d:7h:17m:35s|100
g06 |AMD Athlon(tm) Processor |1328| 256| 4:41:38 pm| 30d:7h:16m:48s|100
g07 | | | | | down |
g08 |AMD Athlon(tm) Processor |1328| 256| 4:41:38 pm| 30d:7h:9m:37s|100
g09 |AMD Athlon(tm) Processor |1328| 256| 4:41:38 pm| 30d:7h:9m:4s|100
g10 |AMD Athlon(tm) Processor |1328| 256| 4:41:38 pm| 30d:6h:7m:38s|100
g11 |AMD Athlon(tm) Processor |1328| 256| 4:41:39 pm| 30d:6h:8m:16s|100
g12 |AMD Athlon(tm) Processor |1328| 256| 4:41:38 pm| 30d:6h:7m:0s|100
g13 |AMD Athlon(tm) Processor |1328| 256| 4:41:39 pm| 30d:6h:10m:36s|100 

where at a glance you can see that in this fragment of the "ganesh"
cluster the server has been up 36 days (since its 7.3 upgrade,
basically) and the nodes have been up 30 (ditto), and that I got the
nodes loaded with tasks almost immediately so that their duty cycle
(computed from /proc/uptime as
 100.0*(hostptr->val.uptime_up - 
    hostptr->val.uptime_idle)/hostptr->val.uptime_up
) is basically 100%.  

You can also see two nodes down (crashed disks, waiting for me to have
time to go mess with fixing them).  On the one hand, very high OS level
efficiency and very high utilization; on the other hand, I teach, do
research, take care of my own cluster, write stuff like wulfstat, and
have a wife and kids and the nodes that broke are out of warranty.
There is a nontrivial amount of University/grant paper to be pushed just
to be permitted to BUY the replacement drives required, and when I've
found time to do that, I still have to do the actual buying.  The
physical act of de-shelving these (minitower) nodes and replacing the
disk is only about ten or twenty minutes of work at that point.

However, if I had bought expensive hardware (these are utterly generic
nodes), expensive service contracts or (equivalently) a hot spare or two
I would have had fewer nodes in the first place, running over a VERY
long time (all the nodes ran continuously for close to a year without a
single failure before I started to lose disks).  My net productivity
over fewer nodes would probably be considerably lower, overall, than it
is enduring my own overprogramming/laziness and leaving nodes down for a
week or two until I have time and energy to fix them.

Just a concrete illustration of how difficult it is to talk about uptime
OR duty cycle without an accompanying CBA.  The real measure of cluster
success is:

  How much of YOUR work are you able to get done (per unit time) for
  YOUR dollar investment?

In some cases you will get more work done with less, more expensive but
more reliable hardware.  In some cases (like mine) you'll get more work
done with more cheaper but less reliable hardware (that still proves to
be pretty reliable, overall).  I don't build clusters to achieve four or
five nines "uptime" -- I build them to get the most work done for my
dollar.  Unless demonstrating "reliability" is itself the work goal, it
is incredibly stupid to >>over<<engineer for high reliability at the
expense of getting work done.

It is this that makes cluster engineering so interesting.  There is a
dazzling range of options and trade-offs to consider -- speed of CPU,
architecture of CPU, amount and kind of memory, network, racks vs towers
and shelving, operating system, compilers, parallel libraries and tools,
numerical packages.  Spend more one one at the expense of what you can
spend on the rest.  To achieve the "perfect balance" of all of this is
not easy, especially with the most important trade-off being your own
time or the amount of human time you're willing to pay for or spend to
run the cluster.  In fact, it is generally a bit of a gamble -- you
>bet< that your choices for hardware and management will provide optimal
yield; it is very difficult to ever verify that you made the "best"
choice, only that your choice was or wasn't good enough.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu