[Beowulf] Re: Cooling vs HW replacement

Thu Jan 27 11:07:51 PST 2005

On Thu, 27 Jan 2005, Josip Loncaric wrote:

> Karen Shaeffer wrote:
> > [...]
> > 
> > If DDMs were interested in helping customers discriminate based on the
> > actual expected lifetime of drives, they would all publish running infant
> > mortality rates, updated weekly, during the production run of their disk
> > drives. Afterall, this is the one metric the entire organization is focused
> > on during production. But, what they hand out is this MTBF number to
> > prospective customers. A number they pay no attention to internally.
> 
> Karen's excellent introduction to the logic of disk drive manufacturing 
> (DDM) is well worth reading -- particularly since the same factors drive 
> other computer manufacturers: rapid product cycles, insane time 
> pressures, thin profit margins, limited opportunity to prevent 
> financially ruinous mistakes, etc.

Agreed.  Been there, been burned (which is why I keep urging caution
about using published numbers from the mfr as a sound basis for
engineering without a grain of salt, especially to people with
relatively little experience in this arena -- cluster newbies).

> Therefore, deciding which drive model (or other component) to use fits 
> under the topic of optimal decision making under uncertainty -- which is 
> a standard part of game theory, often used in operations research, etc.
> 
> Making rational choices, which can withstand scrutiny even when things 
> unexpectedly go wrong, is not just an art.  There is theory to build on.

This is also an excellent contribution to the discussion.  In
particular, I'd urge people building large clusters to consider the
benefits of insuring some of the risks, which is what humans generally
do when confronted with the same problem in the arena of human affairs.
In a large cluster, the economic consequences of a massive component
failure (however common or rare that might be) can be devastating to the
project, to careers, to productivity.  This is a classic component of
game theory applied to real life and is the fundamental raison d'etre
for the insurance industry (and why I keep referring to actuarial data).

One piece of "insurance" is obviously the base warranty of each
component, but this generally protects you only partially from the
actual cost of the replacement hardware itself if a component fails. You
still take a major hit in productivity and diversion of opportunity cost
labor associated with downtime and repair.  Whether or not this
additional cost is affordable depends to a certain extent on luck, to a
certain extent on the "value" of your project.  Speaking from bitter
personal experience with the Tyan 2460 and 2566 motherboards (as well as
anecdotal experiences with various other system components such as
drives, riser cards, cases, case fans, CPUs and CPU fans (OEM AMD Athlon
MP fans in particular) things DO break in mass "catastrophic" bursts a
lot more often than MTBF numbers or even warranties would lead you to
expect, and this cost can be quite high and can drain resources and
energy for years (until the hardware is finally aged out and replaced)
or require an immediate infusion of much money for immediate
replacements, or in our case (where the replacements were themselves a
problem, albeit a lesser one), both.

Practically speaking, hardware "insurance" often means considering
extended and/or onsite warranties -- effectively betting someone that
your systems will break for some percentage of their original cost
(generally ballpark 10% for 3 years).  Extended service has two valuable
purposes -- one is that it obviously directly protects you from bearing
the brunt of the cost of anything from the normal patter bathtub-bottom
failures during the normal lifespan up to mass failures or higher than
expected normal-lifespan failures during the period that the cluster is
expected to be productive.  Other forms of insurance against catstrophic
failure (such as fire or theft insurance and surge protection and door
locks) exist as well, although they tend to be purchased outside of the
engineering/operations loop.

Insurance via extended warranty addresses the paradox of mass failure
(one that might "kill" you or at any rate your project).  Even though it
is often (or even generally) cheaper in terms of expectation value of
the total cost to build a DIY cluster and self insure, excessive
(unlucky) failures are far more likely to be "fatal".  One major
complaint against the HMO industry is that capitation (giving a
physician X dollars per head for a group of patients up front while
obligating the physician to treat all of that group who get sick) is
that it exposes those physicians to the risk of catastrophy in the event
that a plague comes along and strikes the group.  It is anti-insurance
(the passing of risks back to small groups for which the fluctuations
can be fatal rather than assuming the risks spread out over a large
group where it is more predictiable).

It costs you a bit more to insure with a larger group (even with its
more predictable risks), but the benefit you gain is that you'll stay
"alive" no matter what if you can afford the insurance itself in the
first place.  Practically speaking, since additional cost means fewer
nodes, you can choose to definitely get 10% less work done over the
lifetime of your project but ensure that you have a very small chance of
getting only 50% or 30% of the work done (or face massive out of pocket
costs) due to catastrophic failure downtime.  If things go well of
course you lose -- maybe over the same interval you only lose 5% of your
nodes -- and MOST of the time, one expects things to go well, which
encourages people to assume the risk and gamble that things will go
well.

The second is that hardware backed by an onsite service contract and a
company that assumes much of the risk is more likely not to fail in the
first place.  That company has a strong incentive to protect >>their<<
risk in the venture by passing as much as possible back to the mfrs
(even at an additional cost) and to perform additional testing and
system engineering without a disincentive to uncover "bad" components
after (as Karen points out) it is more or less too late to do anything
other than sell them off as best you can and take your lumps.  The
company also typically has some actual clout with the manufacturers and
can dicker out deals that further minimize their (and your by proxy)
risk, both in terms of getting a premium selection of hardware and of
getting better warranty terms per dollar spent.

Deciding your optimum comfort level of risk taking is not easy -- partly
it is subjective, partly it can be made objective if you can assign a
dollar "value" to your time and the up time of your cluster.  Even
humans (with their relatively low failure rate during their "prime
years") tend to buy insurance during this period because even if failure
rates are low, the consequences to your family and loved ones of a
failure are very high.

   rgb

> 
> Sincerely,
> Josip
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu