[Beowulf] Re: Cooling vs HW replacement

Sun Jan 23 08:30:30 PST 2005

On Fri, 21 Jan 2005, Mark Hahn wrote:

> > Humans don't live a megahour MTBF.  Disks damn sure don't.
> 
> that's an attractive analogy, but I think it misses the fact that 
> a disk is mostly in a steady-state.  yes, there's a modest process
> of wear, and even some more exotic things like electromigration.
> but humans, by contrast, are always teetering on the edge of failure.
> I'm tiping back in my chair right now, courting a broken neck.
> I'm about to go out for my 4pm latte, which requires crossing a street.
> none of my disks are doing foolish and risky things like this - 
> most of them are just sitting there, some not even spinning, most 
> occasionally stirring themselves to glide a head across the disk.
> I at least, think of a seek as about as stressful as taking a breath
> (which is not to deny that my breaths and a disks seeks are both,
> eventually, going to come to an end...)
> 
> one of my clusters has 96 nodes, each with a commodity disk in it.
> 10^6/(24*365.2425) = 114.07945862452115147242 years for each disk,
> and 1.18832769400542866117 years for the whole cluster.  since the 
> cluster has good cooling, and the disks not much used, I only expect
> about 1.2 failures per year.
> 
> we're about to buy a cluster with 1536 nodes; assuming the new machineroom
> being built for it works out, we should expect about 1 failure per month.

Let's examine your point, and mine, seriously.  I was quite serious
about sacrificing chickens to 20 year old disks.  There aren't any (even
in environments where people try to run them this long, which are
admittedly quite rare).  I ran a 10 MB IBM disk (one of the best money
could buy -- IBM built arguably the best/most reliable disks in the
world at the time) in my original IBM PC for close to a decade before it
died, but die it did.  I've run a handful (six or seven) of disks out to
maybe eight years out of maybe a hundred that I've tried to keep in
service that long.  Less than 10%.  Those disk do endure a certain
amount of wear at a fairly predictable, fairly steady rate, and, like
the Deacon's One-Hoss Shay, at some point whether or not they break
down, they wear out.

So what I was really addressing is that the "rate of failure" measured
at some point in the disk's lifetime is a lousy predictor of estimated
lifetime.  In fact, it is NOT a predictor of a disk's expected lifetime
in any sense of the term derived from lifetime statistics.  One has to
know the distribution of failures, not the rate of failure at some
point, to determine the mean lifetime.  It's just calculus -- the rate
is the slope of the function we are interested in evaluated at some
specific point (across some specific delta, given that it is a rate).
At best this yields a linear approximation of the function in a Taylor
series -- probably one that optimistically omits the number of disks
that die "immediately" (the constant term at the beginning of said
Taylor series) at that.

So let's think again about humans.  In the USA, humans have a mean
lifetime in the ballpark of 70+ years, a nice human-long time because
humans are actually amazing stable, self-repairing dynamical entities.
Damn few mechanical constructs in nature retain individual, functional
form for this long, including constructs engineered by humans using the
best of current technology.

However, this datum alone is not that useful to e.g.  insurance
actuarialists.  Instead they look at the distribution.  Humans are
initially quite likely to die before they are born -- lots of eggs fail
to implant, lots of pregnancies terminate in miscarriage.  Humans are
relatively likely to die in your first two years, when their immune
(self-repair) system is weak and any defects in their manufacture
process are exposed to a hostile world.  Then they enter a stretch where
the probability of failure is quite low overall, with modest peaks
around the teens followed by a long period where it is very low indeed
(death pretty much only by accident -- "failures" are sufficiently rare
to be considered tragedies and not at all to be expected) until the
internal cellular repair mechanisms themselves start to age and actual
failures start to occur more often than accidents, around age
40-something.  There is then a gradual ramping up of failure rate until
(eventually) nobody gets out alive and damn few humans live to see their
100th year.  One human in a hundred million might live to 114 years (a
Megahour).

Note that even this statistical picture isn't detailed enough to be
useful to actuarialists.  If you use drugs in your veins, are in a
military platoon serving in a little village in the most hostile part of
Iraq, are poor, are rich, have access to good health care or not -- all
of these change your risk of failure.  A human that gets just the right
amount of exercise, has the best medical care, doesn't smoke, drinks
just the right amount, follows a mediterranean diet, has good genes, and
avoids risks can expect (on average) to outlive one that does the
opposite of all of the above by a good long time (pass me them french
fries to go with my beer:-).

Now during their safest years, if you examine the number of humans that
fail per year over some nice short baseline you might find that they,
too, have a good deal more than "1 million hours MTBF", especially if
you specifically exclude accidental death (which is the most likely
cause of death after you are perhaps 2 until you are in your 40's).
This is comforting -- it is why I don't expect to see ANY failures of
the humans in my physics classes, in spite of the fact that there are
hundreds of them, where I absolutely expect to see failures among the
hundreds of disks in my cluster.  In fact, if we saw as many failures
among the humans of my acquaintance (who tend to be in the sweet spot of
their expected lifetimes) as we all do in disks, we'd be screaming our
heads off about epidemics, war, and mayham and would live trembling in
fear.

The human race would be at risk of not living long enough to perpetuate
itself -- how many disks make it to 13 years (age of puberty)?  Enough
so that two disks could cuddle together and produce a mess of little
floppy disklets to replace the hundreds of disks that died well before
then?  I doubt it, unless they produce litters of them with a short
gestational period...

So I reiterate -- MTBF for hard disks, as reported by the manufacturer,
is a nearly useless number.  What matters isn't a rate determined under
controlled conditions during a particularly favorable period in a disk's
lifetime, one that more or less excludes birth defects, accidental
death, and the tremendous variability of load and environmental
conditions (where a machine room that has a transient failure of AC can
be thought of as being sent to Iraq in the aforementioned infantry
platoon).  This is especially true when it is perfectly obvious that the
MTBF of disks averaged over their ENTIRE lifetime is NOT 1 Mhour, which
would imply either that roughly 1/2 the disks make it to 114 years still
operating or that the distribution is highly skewed so that some disks
last for a millenium or ten while the rest die young.

However, manufacturers (for obvious reasons) do not present us with a
graph of actual observed failure from all causes (which we could use to
do a true risk assessment).  They present us with an obviously globally
false number that is almost unbearably optimistic and cheery.  Almost
makes me want to be a disk...

I personally think that the more useful statistic is the true actuarial
one implicit in the following observation.  It used to be that nearly
all hard disks on the planet had one of two warranties.  "Server" class
SCSI disks (this is descriptive, not judgemental or intended to provoke
a flame:-) carried five year warranties, presumably because
manufacturers subjected them to a more rigorous in-house quality
assessment before selling them, effectively removing more of the ones
with birth defects from the population before sending them forth.
"Consumer" class IDE disks carried three year warranties, because they
sold them with less testing and hence there were more DOA's and first
six week failures.

A year or two ago, consumer disk warranties were dropped to a year by
nearly all the disk manufacturers.  If you wanted a three year disk you
had to pay a premium price for it and select a "special edition" disk.

Now I personally think that what has happened is obvious.  Disk is one
of the only components in a computer that carried a 3 year warranty or
better, and it gets harder and harder to engineer a high
quality/reliable disk as density etc keeps ramping up.  Everything gets
smaller, there are more points of failure, the net data load goes up,
the average heat generated goes up (not linearly, but up).  Even though
"MTBF" by their optimistic assessment methodology remains low, the
actual probability of failure from all causes is embarrassingly high.

Now in >>my<< opinion what this really means is that the probability of
current consumer disks getting to 3 years of actual lifetime under load
has gone down to the point where they simply cannot make money on the
margin they charge per disk if they have to replace all the failures.
If you look at the marginal difference in cost of the "special edition"
versions (perhaps 10% of retail), compare it to the cost of a warranty
replacement to the manufacturer (perhaps 50% of retail) you can guess
that they are anticipating that ballpark of one disk in five fails
between the end of year one and the end of year three.  Some unknown
number will also fail in year one -- perhaps enough to bring the total
three year failure rate to 1/4.  That's a believable number to me, based
on my personal anecdotal experience.

In my direct experience with consumer disks, I see roughly 50% failures
within five to six years. I've already experienced one admittedly
anecdotal disk failure out of three put into my household RAID within
its first year of operation, and that IS a special edition 3 year
warranty disk -- it is sitting boxed downstairs ready to ship back.
I've also experienced two failures over three years (out of three disks)
in this RAID's predecessor and something like five disk failures (out of
ten disks over five years) in the household's various workstations.
Anecdotal sure, but I'll bet they are not atypical.  These workstation
disks are nearly idle -- they do some work when the system spins up,
then just sit.  My household RAID isn't exactly hammered by its five
whole users, either, even with me as one of the users.  The disks are
exposed to power failures, cosmic rays, etc.  So it goes.

However, I do experience relatively few failures of disks (that aren't
DOA during burn-in) for the first six months or even year of operation
-- the recent RAID disk (failed at about eight months) is an exception,
not the rule.  Maybe it would even average out to 1 million hours MTBF
(over the first three months of post burn-in operation) who knows?

> another new facility will be 200TB of nearline storage.  if we did it 
> with 1.4e6 hr, 147GB SCSI disks, I'd expect to go 1022 hrs between failures.
> I'd prefer to use 500 GB SATA disks, even if they're 1e6 hrs, since that
> will let me go 2500 hours between failures (not to mention saving around 
> 5KW of power!)

And >>I'd<< expect that you can go 1022 hours between failures in the
first three months of operation, maybe 900 hours between failures in the
second three months of operation, maybe 800 hours between failures in
the third three months of operation, and downhill from there... Or some
other curve -- I don't know what the decay curve is, and I doubt that
the manufacturers will tell you (or that it would be real-world accurate
if they did tell you).  At a guess it is somewhat s-shaped with an
initial spike, a relatively flat period and then a more rapid
exponential starting in a year or two.  We're only seeing "MTBF rates"
reported as the most optimistic slope of that initial flat period.

The one kind of operation that COULD tell you very accurately indeed
what the curve looks like would be somebody like Dell that offers
standard three (or more) year onsite service on entire systems,
including disks, for a fee.  Disk insurance salesmen, in other words.
Their databases would let you determine the curve quite precisely, at
least for their choice of hardware manufacturer(s).  In fact, their
databases are doubtless accurate enough that they can very deliberately
choose the best manufacturers in the specific sense that they cost the
least (for a given storage size) integrated over all warranty and
service obligations -- they MUST be accurate enough that they recover
costs and make a profit, or they'll fire their actuarial database folks
and start over.

    rgb

> 
> regards, mark hahn.
> 
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu