[Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?

Chris Samuel csamuel at vpac.org
Mon Apr 6 18:34:29 PDT 2009


----- "Joe Landman" <landman at scalableinformatics.com> wrote:

> Chris Samuel wrote:
>
> > for i in md[0123]; do
> >    echo check > /sys/block/$i/md/sync_action
> > done
> 
> Are these softirq cpu hangs?

Nope, these are SCSI read errors back from the drives..

I've now been asked to update the IBM driver (they don't
support the RHEL one) and the firmware on the disks, both
of which have been released in the last few days with
vaguely possibly applicable changelogs..

> could you tell me what
> 
>    cat /sys/block/md[0123]/md/stripe_cache_size
> 
> reports?

They're 256 on RHEL5.3 vanilla - same as on CentOS
(2.6.18-92.1.10.el5PAE) and Debian (2.6.28.9).

cheers!
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency



More information about the Beowulf mailing list