[Beowulf] Re: Cooling vs HW replacement

Fri Jan 21 12:04:24 PST 2005

> an analysis???  Buy the 5 year disk and you'll have a working
> disk, or a replacement for it, for 5 years.

and my real point was that everyone should ask whether they really
want/need that before "paying for quality".

one reason not to is that 3 or 5 years from now, disks will be much
better.  when the improvement curve is steep, it's not to your advantage to
"invest" in a more long-lived product.  remember, the cost difference
is not just a few percent, but at least 4x.  so you get 4x as much storage,
even if it only lasts 60% as long.  so bank some of it, and you can 
replace all your disks every ~2 years or so (*and* appreciate the
improvements in disks.)  or just use raid, appreciate the same or higher
reliability, more space, and probably higher performance.

> they don't fail 4x faster than the Cheetahs.  Google around
> for "disk reliability" though and you'll find some real horror
> stories about disk failure rates in, for instance,
> SCSI -> ATA RAID arrays.

unfortunately, anecdotal evidence is nearly useless outside of sociology ;)

> Generally yes, we do want that disk to still be working at 5 years.

interesting.  I don't mind if things survive past 3 years, but don't 
generally plan to use them, at least not in their original purpose.

there's just too much to be gained by upgrading after 3 years.

> My personal experience with ATA disks in servers has been limited.

hah!  this *is* the only really reliable part of anecdotal evidence - 
the demographic of respondents ;)

> help a lot to have some objective measure of that.  When Seagate
> starts selling 5 year SATA disks I'll consider buying them.

http://info.seagate.com/mk/get/AMER_WARRANTY_0704_JUMP

> There's also a school of thought that RAID arrays should be
> "disk scrubbed" frequently (all blocks on all disks read)
> to force hardware failures and block remapping to occur early
> enough so that the redundant information present
> in the array can rebuild from what's left.  As opposed to a worst
> case where the data is written once, not touched for a year,
> and then fails unrecoverably when a read hits multiple bad blocks.

it's all about what failure modes you're expecting, with what probability.
if you want scrubbing, it means you're expecting some sort of silent 
media degredation.  that's not unreasonable, and it might even be sane 
to expect that more on commodity disks rather than premium ones.
(IMO mainly because commodity densities are so much higher, and premium
disks are clearly designed to trade poor density for higher robustness.)

but maybe you should scrub premium disks as well, since if you really
haven't touched some part of the disk, you really don't have any data 
on the particular disks's reliability.  it would be quite interesting 
if one could obtain some sort of quality measure from the disk while 
in use.  I notice that ATA supports a read-long command, for instance,
which claims to actually give you the raw block *and* the ecc associated.
SMART also provides some numbers (as well as self-tests) that might be 
useful here.

but silently crumbling media is not the only failure mode!  I'm not even
sure it's a common one.  I see more temperature and vibration-related 
troubles, for instance.

> On the Sun Oracle server, much more so.

out of curiosity, is the machine otherwise well-configured?  for instance,
does it have a sane amount of ram (1GB/cpu is surely minimal these days,
and for a DB, lots more is often a good idea.)  or is the DB actually 
quite small, but incredibly update-heavy?

> > under what circumstances will you have a 100% duty cycle?  
> 
> Probably never?  But where in between 100% and 0% is the cutover
> point where increased disk failure rate costs just equal the
> savings from using cheaper disks?  

complicated, for sure.  but since premium disks are ~5x more expensive
and certainly not 5x more reliable, it's probably worth pondering...

> the most reliable disks available on the head node, but it's most
> likely worth it to avoid having to replace those critical components.
> Conversely, the number of compute nodes isn't usually critical
> so there's not as much reason to pay for more expensive disks there.

except that raid of commodity disks can trivially match the reliability
of un-raided premium disks.

overall, I don't criticise people who use high-quality disks in 
critical and heavily-loaded choke-points.  indeed, I do it myself.
what bothers me the often unquestioned assumption that using premium
disks is always better.  it's not for nodes.  it's not for big storage.
it's probably not even for user-level filesystems (/home and the like).
but for a non-replicated fileserver that provides PXE/kernel/rootfs
for 1000 diskless nodes, well, duh!

the real point is that raid and server replication make it easy 
to design around critical-and-overloaded hotspots.

regards, mark hahn.