[Beowulf] Surviving a double disk failure

Sun Apr 19 01:40:52 PDT 2009

----- "Joe Landman" <landman at scalableinformatics.com> wrote:

> 2) Scrub early, scrub often.

As long as you don't have IBM gear where what appears to
be a firmware issue somewhere (possibly on the disks themselves)
can mean that the LSI RAID controller they rebadge thinks
that up to 12 drives have just failed in the space of a
few minutes.

Of course none of them really have failed, but your RAID60
is still toast and boy does it take a few years off your life,
not to mention days and days to recover from tape..

Sigh..

Happens under Debian (with mainline kernel) and CentOS
with its stock kernel (we copied over the scrub script
that Debian packages), but of course IBM wouldn't take
any notice until we could do it under RHEL - you can
trigger a scrub manually through (for example):

echo check > /sys/block/md0/md/sync_action

We now have another vendors storage unit and won't
think about using the IBM unit in anger until we can
confirm that the latest round of firmware updates have
solved the problem.

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency