[Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?

Rahul Nabar rpnabar at gmail.com
Tue Apr 7 17:28:53 PDT 2009


On Tue, Apr 7, 2009 at 5:12 PM, Matt Lawrence <matt at technoronin.com> wrote:
> See if there is a standalone diagnostics CD for these systems.  If you can
> get the error to occur with it, let Dell fix it from there.

Tried the CD but cannot get the error to occur in a reasonable time
(ran it for about 3 days; baybe I ought to run it longer by taking a
entire node out of production). There are two issues (1) even during
production the error has a statistical mean-time-between-crashes of
around 1 to 2 weeks.  (2) I am not really sure the diagnostic CD tests
all the relevant calls and CPU functions.

I wish there was a "I do X and the error occurs" That would be simple.
This is the class of non-repeatable one-off errors that is hard to
demonstrate.

-- 
Rahul




More information about the Beowulf mailing list