[Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?
lindahl at pbm.com
Mon Apr 6 01:08:36 PDT 2009
On Mon, Apr 06, 2009 at 02:54:23AM -0500, Rahul Nabar wrote:
> Eventually they send me swaps for the Motherboard and CPU. No
> go. Still hangs at random.
Unfortunately there is no magic bullet. I have seen bad batches of
power supplies cause a problem like this. In another case, an
integrator-supplied Linux kernel built with some unfortunate debugging
options turned on was causing all the hangs.
>From your symptoms, the power supply seems to be the next thing to
suspect. From your switch of distros, it's probably not a particular
bad Linux kernel. You have a few completely new machines that don't
hang; move the known good power supplies to other nodes with suspect
mobos and cpus.
> I haven't really pored at all the legalese in our contracts but is
> there a "lemon-law" analog for computers? If 20% of the machines are
> bad in the first one year do you think I can press for a better
> resolution from Dell?
Your university's boilerplate T's & C's probably have some text that
says something like "the stuff you sell us has to work, even if the
way it fails isn't something explicitly discussed in the contract."
But, after an entire year, it will be hard to do anything. You lost
leverage when you paid Dell. It's more likely that Dell will convince
your University purchasing people that you are an idiot than the
> Just wanting to hear more about how I can best resolve this issue. For
> our future purchases would changing vendors help?
Not really. I don't think there's any global trend among vendors; you
find people with horror stories all over. Have I ever told the story
of the mobo with the exploding caps? 1/1000 chance of blowing up each
time it was power cycled. Kinda obvious in a 1000 node cluster... how
it slipped through the mobo vendor's QA ? ...
More information about the Beowulf