[Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?

Rahul Nabar rpnabar at gmail.com
Mon Apr 6 09:36:33 PDT 2009

On Mon, Apr 6, 2009 at 3:08 AM, Greg Lindahl <lindahl at pbm.com> wrote:

> >From your symptoms, the power supply seems to be the next thing to
> suspect. From your switch of distros, it's probably not a particular
> bad Linux kernel. You have a few completely new machines that don't
> hang; move the known good power supplies to other nodes with suspect
> mobos and cpus.

Thanks Greg. That could be it. I might give the power-supply idea a shot.

> Your university's boilerplate T's & C's probably have some text that
> says something like "the stuff you sell us has to work, even if the
> way it fails isn't something explicitly discussed in the contract."
> But, after an entire year, it will be hard to do anything. You lost
> leverage when you paid Dell. It's more likely that Dell will convince
> your University purchasing people that you are an idiot than the
> reverse.

Well, Dell get's paid almost when they deliver the machines each time.
So there's no leverage there anyways. Just curious: do any of you have
clauses wherein you pay Dell after they have demonstrated trouble free
ops for the first year or some such? We might want to add a similar
clause to our contracts in the light of this experience.

> Not really. I don't think there's any global trend among vendors; you
> find people with horror stories all over. Have I ever told the story
> of the mobo with the exploding caps? 1/1000 chance of blowing up each
> time it was power cycled. Kinda obvious in a 1000 node cluster... how
> it slipped through the mobo vendor's QA ? ..

Yeah, we got screwed by a similar capacitor issue. The Optiplexes we
were using in our legacy home-brewed cluster (before we started buying
rack servers) had a capacitor-recall. It was a widely-known issue.
Dell started providing us with motherboards on those machines that
crashed because of leaky capacitors. But they convinced us they'd keep
doing it on a machine-by-machine basis and we were happy.

Unfortunately somewhere along the way our warranty ended and the
sys-admin tracking the problem left.  I find some newly dead machines
with the same problem and then they tell me that "the recall has ended
and you are out of warranty" No go.

Which is why I am so desperate to find what our SC1435 problem
actually is and get Dell to do the swapping while we are still safe
under our warranty. We got burnt and this time try to be smarter!


More information about the Beowulf mailing list