[Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?
Joe Landman
landman at scalableinformatics.com
Mon Apr 6 06:11:50 PDT 2009
Chris Samuel wrote:
> ----- "Rahul Nabar" <rpnabar at gmail.com> wrote:
>
>> I contact Dell. Responses range from the clueless to absurd. First,
>> they convinced us it was Fedora. So I shifted to CentOS. They still
>> claim CentOS is "unvalidated" but I refuse to spend a fortune to move
>> over to RHEL like they want me to.
>
> Not that this helps, but you have my sympathy as I've
> been dealing with the same stuff from IBM over a storage
> server they sold us.
>
> Turns out I can make 7-12 drives in their external
> enclosures fail in short order (seconds to minutes
> between failures) by telling the software RAID to
> do a check, thus:
>
> for i in md[0123]; do
> echo check > /sys/block/$i/md/sync_action
> done
Are these softirq cpu hangs?
could you tell me what
cat /sys/block/md[0123]/md/stripe_cache_size
reports?
>
> Even though we could reproduce it on 64-bit Debian
> and 32-bit CentOS they wouldn't escalate the issue
> until we could reproduce it on RHEL5 - which we did
> today.
>
> Sigh..
>
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
More information about the Beowulf
mailing list