[Beowulf] RE: real hard drive failures

Wed Jan 26 08:25:23 PST 2005

> > >     - raid will NOT prevent your downtime, as that raid box
> > >     will have to be shutdown sooner or later
> > >     ( shutting down sooner ( asap )  prevents data loss )
> >
> > huh?  hotspares+hotplug=zero downtime.
> 
> you're assuming that "hot plug" work as its supposed to
>     - i usually get the phone calls after the raid didnt
>     do its magic for some odd reason

My impression, based solely on web research and not personal
experience, is that RAIDs that don't rebuild are often suffering
from "latent lost block" syndrome.  That is, a block on disk 1
has gone bad, but hasn't been read yet, so that bad block is
"latent".  Then disk 2 fails.  The RAID tries to rebuild and
now tries to read the bad block on disk 1, gets a read error,
and that's pretty much all she wrote for that chunk of data.
The "fix" is to disk scrub, forcing reads of every block
on every disk periodically, and so converting the "latent"
bad blocks into "known" bad blocks at a time when the RAID still
has sufficient information to rebuild a lost disk block.
Also use SMART to keep track of disks which have started
to swap out blocks and replace them before they fail totally.
Deciding how many bad blocks is too many on a drive seems
like it might be a fairly complex decision in a storage
environment involving hundreds or thousands of disks.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech