[Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?

Rahul Nabar rpnabar at gmail.com
Mon Apr 6 08:50:58 PDT 2009


On Mon, Apr 6, 2009 at 3:20 AM, John Hearns <hearnsj at googlemail.com> wrote:
> 2009/4/6 Rahul Nabar <rpnabar at gmail.com>:
>> We had bought 23 Dell-SC1435-PowerEdge servers for our latest cluster
>> addition mid-2008. These batch of machines has proved to be a total
>> disaster from Day one. I was looking for suggestions how I should
>> tackle this.
>
> Sending an email to an international email list composed of people who
> make purchasing decisions
> for future machines costing many thousands of $$$$ is an excellent start!
>
> Seriously - I hope your Dell account manager sees this and starts kicking butt.

Thanks Joe! One of the biggest problems in a huge setup like Dell's is
the lack of a clear incentive to help me. The tech-services guy isn't
the one who's making  money on the next order. Actually each time you
almost start speaking to a new person so rarely do you find a rep.
wanting to take the courageous decision of deciding it is indeed a bad
system and swapping it (or alternatively spending hours in a truly
out-of-the-box debug cycle to diagnose what exactly the bad component
is).

Most merely prolong the cycle by mindless testing and running of
sosreports and dsets and such. Some create even more activity by
asking me to change distros and sending me a CPU here and a
power-supply there in the hopes that something will work!

-- 
Rahul



More information about the Beowulf mailing list