[Beowulf] Consumer vs. Enterprise Hard Drives in Clusters

Fri Jan 23 14:13:02 PST 2009

Jon Forrest wrote:
> Wouldn't the effect of vibrations from multiple drives depend
> greatly on the mechanical properties of the bay enclosure and
> the chassis itself? For example, I have a 16 bay enclosure that's
> built like a tank (I know because I dropped it once). I would
> think that the vibration of one drive would barely be noticed
> by others. Of course this question could be answered by measuring,
> assuming the presence of the right instrumentation, which I don't have.

I've seen little correlation between weight and vibration.  After all even the
built like a tank hardware is still noisy.

>> * Consumer drives (at least the non-media ones) often have occasional
>>   thermal recalibrations.   This seems better these days, but last thing
>>   you want is a recal triggering a degraded array.
> 
> What does the RAID controller and OS see when such a thing happens?

Just a delay between read/write and the answer.  Usually there is a timeout,
after all a completely dead drive might never answer.

>> * Consumer drives will go to heroic efforts to read a bad sector, exactly
>>   the opposite of what you want in a RAID drive.  In a RAID it's
>> better to
>>   fail and yell bloody murder... especially when the rereading a sector
>>   a bunch causes the raid to time out and drop the disk.
> 
> But wouldn't failing and yelling bloody murder be treated by
> the RAID controller the same as when a drive times out? 

Correct.

> In either
> case, I would expect the RAID controller to see the drive as having
> failed. Then, when you replace the drive you'd cause a RAID unit
> rebuild which is a very dangerous thing to do these days given how
> large drives are and the chances of an I/O error occurring during
> the rebuild.

Well you don't want the drive hiding the fact that you had to retry 10 times
to read a sector.  Sure smartctl can track this kind of thing, strangely
hardware RAID controllers often hide that info from the operating system.
Basically for a raid you want a yes you have this block or no you don't have a
 block within a fairly low time windows.  Especially in the gruesome case of a
manual rebuild where you don't want the marginal sectors sending your drive
into la la land preventing you from getting the perfectly healthy blocks off.

It all comes down to it's easier to deal with a sorry, can't get that block
within 50ms then handle a drive that disappears for 10's of seconds at a time.

The kind of nightmare scenarios I've seen is a 16 disk array bit rot starts,
the array looks perfect, but of course the number of invisible retries starts
increasing.  If you are using a pathetically old kernel (like say the standard
RHEL kernel) you don't have ECC scrubbing.  Then of course a drive drops, you
go to rebuild, then a 2nd drive hits an error (that has been silent till now).
 Then you are in a position where you want to scan all drives and hope that
the errors that you find are not aligned with the errors on other drives.
With RAID edition drives you can do such a rebuild in a reasonable amount of
time, with desktop drives, even one that is 99% good blocks can lead to very
high rebuild times.

I'm guessing that when a 120MB/sec consumer drive is providing 20-30MB/sec
that it's service life is shortened, but I've no numbers to back that up.  In
the same conditions a raid edition drive provided 75MB/sec or so with or
without vibration.

Manufacturers are starting to mention the number of drives in a RAID... they
seem to be differentiating between single drive, 2-4 drive arrays, and larger.

>> Of course manufactures claim various things about error rates per billion
>> bits, designed duty cycles (40 hours a week vs 24/7), improve temperature
>> envelops, and related.  Alas while this is nice to hear I've not seen any
>> direct results because of it.
> 
> I too agree it would be nice but neither you nor Google appear to be
> seeing it (assuming, as one poster said, that we're all using the
> same definition of "enterprise drive").
> 
> What's surprising to me is that if this were true then I'd expect the
> manufacturer's warranty to be different for the two classes of drives.
> Maybe this is the reason that Seagate is changing to a 3-year warranty
> for their consumer drive (I haven't seen anything about the warranty
> for the enterprise drives).

Hrm, well, I don't see why a consumer drive shouldn't last 5 years for a
consumer (when used as a consumer) and an enterprise drive shouldn't last 5
years for enterprise use.

>> As an example, 500GB wd caviar $64.99, 500GB WD RE3 $89.99.  IMO if
>> you are
>> building a raid or heavily used 1U with a ton of fans the extra $25 is
>> worth it.
> 
> If there are real differences between the drives then this would be an
> easier decision. There doesn't appear to be a consensus, however, that
> the differences in the field are significant.

I've personally seen a large difference in performance delivered (as a percent
of peak) due to vibration.