[Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?

Rahul Nabar rpnabar at gmail.com
Mon Apr 6 16:18:05 PDT 2009


On Mon, Apr 6, 2009 at 2:37 PM, John Bushnell
<john.bushnell at icb.ucsb.edu> wrote:
> I never call support until after I have diagnosed a problem myself as much
> as possible.  One of the advantages of buying a batch of nodes at once is
> that you can easily swap components between nodes to isolate the real
> problem.  You will find Dell support easier to deal with (or any other
> vendor for that matter) if you can concisely tell them all of the steps that
> you took to determine that component X needs replacement.  Yes, I have had a
> bad vendor give me the run around, but the better information that you can
> put into an initial service call, the better service you will tend to
> receive.  If you just say "my node stopped working", they will assume that
> you don't know what you're doing.

What puzzles me is this:

Someone had to write the code that produces the error that my
baseboard controller logs:

"Critical error; Voltage sensor (VCORE) critical error. State asserted
CPU2" etc.

In all my naivete I'd expect it to be a branch responding to some
error condition. Why is it being so hard for the vendor to at least
single out which chip or component that error was designed to flag?
One could argue "many conditions can result in this specific error"
but then again what's the point behind a trap so generic.

I wish I could pore over the source myself just for kicks. Or if
somehow I could get access to the guy who coded the firmware on that
BMC!

-- 
Rahul




More information about the Beowulf mailing list