[Beowulf] Re: Cooling vs HW replacement

Fri Jan 21 11:09:41 PST 2005

> 
> > or "Server" grade disks still cost a lot more than that.  For
> 
> this is a very traditional, glass-house outlook.  it's the same one
> that justifies a "server" at $50K being qualitatively different 
> from a commodity 1U dual at $5K.  there's no question that there 
> are differences - the only question is whether the price justifies
> those differences.

The MTBF rates quoted by the manufacturers are one indicator
of disk reliability, but from a practical point of view the number
of years of warranty coverage on the disk is a more useful metric.

The manufacturer has an incentive to be sure that those disks
with a 5 year warranty really will last 5 years.  Unclear
to me what their incentive is to support the MTBF rates since only
a sustained and careful testing regimen over many, many disks could
challenge the manufacturer's figures.  And who would run such
an analysis???  Buy the 5 year disk and you'll have a working
disk, or a replacement for it, for 5 years.

In some uses it would clearly be cheaper to
use (S)ATA disks and replace them as they fail, so long as
they don't fail 4x faster than the Cheetahs.  Google around
for "disk reliability" though and you'll find some real horror
stories about disk failure rates in, for instance,
SCSI -> ATA RAID arrays.

> 
> the real question is whether "server" disks make sense in your
application.
> what are the advantages?
> 
> 	1. longer warranty - 5yrs vs typical 3ys for commodity disks.
> 	this rule is currently being broken by Seagate.  the main caveat
> 	is whether you will want that disk (and/or server) in 3-5 years.

Generally yes, we do want that disk to still be working at 5 years.
Cannot predict whether or not the hardware will have been replaced
before then.

> 
> 	2. higher reliability - typically 1.2-1.4M hours, and usually 
> 	specified under higher load.  this is a very fuzzy area, since 
> 	commodity disks often quote 1Mhr under "lower" load.

Exactly.  It's very, very hard to figure out just how much reliability
one is trading for the lower price.  Anecdotally, for heavy disk
usage, it's apparently a lot.  Anecdotally, for low disk usuage,
ATA disks aren't all that reliable either.

> 
> 	3. very narrow recording band, higher RPM, lower track density.
> 	these are all features that optimize for low and relatively
> 	consistent seek performance.  in fact, the highest RPM disks actually
> 	*don't* have the highest sustained bandwidth - "consumer" disks are 
> 	lower RPM, but have higher recording density and bandwidth.

Right.  On the other hand, anecdotal evidence suggests that an
application like, for instance, a busy Oracle database running
on top of RAID - ATA storage will result in a very high rate
of disk failure, whereas the equivalent RAID - SCSI/FC Cheetah
solution will not suffer an equivalent disk failure rate.  Again,
from Google results, not personal experience. Well, not much
personal experience, we do have a 4 disk FC Raid in one Sun
server and have not lost a disk yet (coming up on 2 years).

My personal experience with ATA disks in servers has been limited.
A smallish Solaris server configured with "cutting edge,
large capacity" ATA disks failed an IBM and the
replacement Western Digital in 1 month each.  Backing way off
on the capacity and going to older 40Gb IBM ATA disks did
the trick, with no further disk failures in 3 years.

> 
> 	4. SCSI or FC.  always has been and apparently always will be 
> 	significantly more expensive infrastructure than PATA was
>       or SATA is.

Agreed.  I'd be perfectly happy to buy SATA or PATA disks _IF_ they
were as reliable as the more expensive SCSI or FC disks.  It would
help a lot to have some objective measure of that.  When Seagate
starts selling 5 year SATA disks I'll consider buying them.

> 
> so really, you have to work to imagine the application that
> perfectly suits a "server" disk.  for instance, you can
> obtain whatever level of reliability
> you want from raid, rather than ultra-premium-spec disks.

In theory.  In practice local experience (another lab) was that
the RAID - ATA solution failed, twice, and was unable to rebuild
from what was left, with all data lost.  Maybe that was the
controller or just a really bad set of disks.  I wasn't there
to witness the teeth gnashing and finger pointing. This wasn't a
tier one storage vendor (Sun, EMC, HP, etc.) so they saved some money.
Or did they???

There's also a school of thought that RAID arrays should be
"disk scrubbed" frequently (all blocks on all disks read)
to force hardware failures and block remapping to occur early
enough so that the redundant information present
in the array can rebuild from what's left.  As opposed to a worst
case where the data is written once, not touched for a year,
and then fails unrecoverably when a read hits multiple bad blocks.

>  is your data 
> access pattern really one which requires a disk optimized for seeks?

On the beowulf not so much.  Most of the workload has
been configured so that the compute nodes have their data cached
in memory and only read the disks hard when booting up and the first
time they read their databases.

On the Sun Oracle server, much more so.

> 
> under what circumstances will you have a 100% duty cycle?  

Probably never?  But where in between 100% and 0% is the cutover
point where increased disk failure rate costs just equal the
savings from using cheaper disks?  

> 
> in summary: there is a place for super-premium disks, but it's just plain
> silly to assume that if you have a server, it therefore needs SCSI/FC.
> you need to look at your workload, and design the disk system based on 
> that, using raid for sure, and probably most of your space on 5-10x 
> cheaper SATA-based storage.

I'd be a lot more comfortable buying the cheaper disks if there was
some objective measure for an accurate prediction of their actual
longevity.  I tend to look at it from the other direction.  A disk
failure on the head node is a much bigger deal than a disk failure
on the compute nodes.  Also the number of disks involved is likely
to be less for the former than the latter.  That is, one might have
10 disks in a RAID on the head node but 70 ATA disks out on the
compute nodes.  So it might cost a couple of thousand more to use
the most reliable disks available on the head node, but it's most
likely worth it to avoid having to replace those critical components.
Conversely, the number of compute nodes isn't usually critical
so there's not as much reason to pay for more expensive disks there.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech