[Beowulf] real hard drive failures

Donald Kinghorn kinghorn at pqs-chem.com
Tue Jan 25 08:16:33 PST 2005

I'm only partially interested in the thread "Cooling vs HW replacement" but 
the problem with drive failures is a real pain for me. So, I thought I'd 
share some of my experience.

I do clusters for computational chemistry and every node has two drives raid 
striped for scratch since some comp chem procedures require huge amounts of 
scratch space. Our older systems were typical rack mounts but overt the last 
year and a half we have used a custom chassis with better cooling ... 

We have used mostly Western Digital (WD) drives for > 4 years. We use the 
higher rpm and larger cache varieties ...

We also used IBM 60GB drives for a while and some of you will have experienced 
that mess ... approx. 80% failure over 1 year time frame!

Observations on WD drive failures: (estimates)

WD 20, 40, 60 GB drives in the field for 3+ years, [~600 drives]  very few, ( 
<1%) failures most machines have been retired.

WD 80GB drives in the field for 1+ years, [~500 drives] "ARRRRGGGG!" ~15% 
failure and increasing. I send out 3-5 replacement drives every month. 

WD 120 and 200GB SATA in the field <1 year, [~400 drives] one failure so far.

I'm moving to a 3 drive raid5 setup on each node (drives are cheap, down time 
is not) and considering changing to Seagate SATA drives anyone care to offer 
opinions or more anecdotes?  :-)

Best wishes to all

Dr. Donald B. Kinghorn Parallel Quantum Solutions LLC

