[Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?

Chris Samuel csamuel at vpac.org
Mon Apr 6 05:46:29 PDT 2009


----- "Rahul Nabar" <rpnabar at gmail.com> wrote:

> I contact Dell. Responses range from the clueless to absurd. First,
> they convinced us it was Fedora. So I shifted to CentOS. They still
> claim CentOS is "unvalidated" but I refuse to spend a fortune to move
> over to RHEL like they want me to.

Not that this helps, but you have my sympathy as I've
been dealing with the same stuff from IBM over a storage
server they sold us.

Turns out I can make 7-12 drives in their external
enclosures fail in short order (seconds to minutes
between failures) by telling the software RAID to
do a check, thus:

for i in md[0123]; do
   echo check > /sys/block/$i/md/sync_action
done

Even though we could reproduce it on 64-bit Debian
and 32-bit CentOS they wouldn't escalate the issue
until we could reproduce it on RHEL5 - which we did
today.

Sigh..

-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency



More information about the Beowulf mailing list