[Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?

Rahul Nabar rpnabar at gmail.com
Tue Aug 11 15:31:08 PDT 2009


On Thu, Apr 9, 2009 at 11:35 AM, Douglas J.
Trainor<trainor at transborder.net> wrote:
> Rahul,
>
> I think Greg et al. are correct.  Does your SC1435 have a Delta Electronics
> switching power supply?  I bet you have a 600 watt Delta.
>
> Intel recently had problems with outsourced 350 watt "FHJ350WPS" switching
> power supplies that apparently affected 5% of some server lines.  These were
> loading imbalance problems between the 3.3 volt and 12 volt lines.  The
> affected power supplies had a minimum loading requirement that was not met.
>  The over-voltage protection circuit would kick in on the 3.3V line.
>  However, in these cases, the Intel machines would not reboot.  Intel is
> modifying the 3.3 volt minimum loading from 1.2 amps to 0.2 amps to fix the
> problem.


A while ago I had posted about these crashing SC1435's that I had. I
received lots of good suggestions on this group. Thanks all!

A lot of persistence with the vendor succeed in making their
Engineering team do long-run tests on one of our captured machines. It
needed to be tested for over one month and then they finally
replicated the failure. Whew! (In the past they had aborted tests way
before this time period)

They won't give me many internal details but apparantly it is caused
by an "hardware issue more likely caused certain motherboards with
Opterons" [sic]

So, thank again and it does seem that we finally got down to the cause
of this irritating problem! Just posted this in case it helps any
other SC1435 admins in a similar boat!

Cheers!

-- 
Rahul




More information about the Beowulf mailing list