[Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?

Joe Landman landman at scalableinformatics.com
Mon Apr 6 06:11:50 PDT 2009


Chris Samuel wrote:
> ----- "Rahul Nabar" <rpnabar at gmail.com> wrote:
> 
>> I contact Dell. Responses range from the clueless to absurd. First,
>> they convinced us it was Fedora. So I shifted to CentOS. They still
>> claim CentOS is "unvalidated" but I refuse to spend a fortune to move
>> over to RHEL like they want me to.
> 
> Not that this helps, but you have my sympathy as I've
> been dealing with the same stuff from IBM over a storage
> server they sold us.
> 
> Turns out I can make 7-12 drives in their external
> enclosures fail in short order (seconds to minutes
> between failures) by telling the software RAID to
> do a check, thus:
> 
> for i in md[0123]; do
>    echo check > /sys/block/$i/md/sync_action
> done

Are these softirq cpu hangs?

could you tell me what

   cat /sys/block/md[0123]/md/stripe_cache_size

reports?

> 
> Even though we could reproduce it on 64-bit Debian
> and 32-bit CentOS they wouldn't escalate the issue
> until we could reproduce it on RHEL5 - which we did
> today.
> 
> Sigh..
> 


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615



More information about the Beowulf mailing list