[Beowulf] Surviving a double disk failure

Orion Poplawski orion at cora.nwra.com
Fri Apr 10 10:16:20 PDT 2009

Bill Broadley wrote:
> Guy Coates wrote:
>> Yikes, epic recovery.
>>> What are the lessons learnt?
>> You forgot the obvious one.
> I suggest ditching silly old centos/redhat kernels and run something new
> enough to allow for scrubbing.  So that all your disks don't silently start
> collecting errors waiting to cascade into a lost RAID upon the first
> non-silent error.

As a stop-gap solution here I periodically use "smartctl -t long 
/dev/<blah>" on all the disks to check their status.  I have a daily 
cron that does one disk a day on my 26 disk servers so each disk checks 
checked once a month.

Orion Poplawski
Technical Manager                     303-415-9701 x222
NWRA/CoRA Division                    FAX: 303-415-9702
3380 Mitchell Lane                  orion at cora.nwra.com
Boulder, CO 80301              http://www.cora.nwra.com

More information about the Beowulf mailing list