[Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?
Douglas J. Trainor
trainor at transborder.net
Thu Apr 9 09:35:32 PDT 2009
I think Greg et al. are correct. Does your SC1435 have a Delta
Electronics switching power supply? I bet you have a 600 watt Delta.
Intel recently had problems with outsourced 350 watt "FHJ350WPS"
switching power supplies that apparently affected 5% of some server
lines. These were loading imbalance problems between the 3.3 volt and
12 volt lines. The affected power supplies had a minimum loading
requirement that was not met. The over-voltage protection circuit
would kick in on the 3.3V line. However, in these cases, the Intel
machines would not reboot. Intel is modifying the 3.3 volt minimum
loading from 1.2 amps to 0.2 amps to fix the problem.
On Apr 6, 2009, at 12:36 PM, Rahul Nabar wrote:
> On Mon, Apr 6, 2009 at 3:08 AM, Greg Lindahl <lindahl at pbm.com> wrote:
>>> From your symptoms, the power supply seems to be the next thing to
>> suspect. From your switch of distros, it's probably not a particular
>> bad Linux kernel. You have a few completely new machines that don't
>> hang; move the known good power supplies to other nodes with suspect
>> mobos and cpus.
> Thanks Greg. That could be it. I might give the power-supply idea a
>> Your university's boilerplate T's & C's probably have some text that
>> says something like "the stuff you sell us has to work, even if the
>> way it fails isn't something explicitly discussed in the contract."
>> But, after an entire year, it will be hard to do anything. You lost
>> leverage when you paid Dell. It's more likely that Dell will convince
>> your University purchasing people that you are an idiot than the
> Well, Dell get's paid almost when they deliver the machines each time.
> So there's no leverage there anyways. Just curious: do any of you have
> clauses wherein you pay Dell after they have demonstrated trouble free
> ops for the first year or some such? We might want to add a similar
> clause to our contracts in the light of this experience.
>> Not really. I don't think there's any global trend among vendors; you
>> find people with horror stories all over. Have I ever told the story
>> of the mobo with the exploding caps? 1/1000 chance of blowing up each
>> time it was power cycled. Kinda obvious in a 1000 node cluster... how
>> it slipped through the mobo vendor's QA ? ..
> Yeah, we got screwed by a similar capacitor issue. The Optiplexes we
> were using in our legacy home-brewed cluster (before we started buying
> rack servers) had a capacitor-recall. It was a widely-known issue.
> Dell started providing us with motherboards on those machines that
> crashed because of leaky capacitors. But they convinced us they'd keep
> doing it on a machine-by-machine basis and we were happy.
> Unfortunately somewhere along the way our warranty ended and the
> sys-admin tracking the problem left. I find some newly dead machines
> with the same problem and then they tell me that "the recall has ended
> and you are out of warranty" No go.
> Which is why I am so desperate to find what our SC1435 problem
> actually is and get Dell to do the swapping while we are still safe
> under our warranty. We got burnt and this time try to be smarter!
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf