[Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?

Mon Apr 6 00:54:23 PDT 2009

We had bought 23 Dell-SC1435-PowerEdge servers for our latest cluster
addition mid-2008. These batch of machines has proved to be a total
disaster from Day one. I was looking for suggestions how I should
tackle this. We are a fairly small university setup and I don't have
much experience dealing with these vendor issues.

History:

I put these machines into production in Aug '08. Within a month we had
the first machine go bad. They hang with a amber LED and the
logging-module clearly logs an error of the sort: "Voltage sensor
(VCORE) critical error. State asserted CPU2". Machine needs a
power-cycle physically from back-plane to restart

I contact Dell. Responses range from the clueless to absurd. First,
they convinced us it was Fedora. So I shifted to CentOS. They still
claim CentOS is "unvalidated" but I refuse to spend a fortune to move
over to RHEL like they want me to. I doubt this has anything to do
with our problem anyways. I discussed this problem extensively on the
Beowulf group back then and got many excellent suggestions, thanks!
http://www.beowulf.org/archive/2008-October/023547.html and
http://www.beowulf.org/archive/2008-October/023549.html

Then I go through the whole circus running dset, ipmi, sosreport and a
bunch of stress-testing tools they sent me. It all takes a lot of
time. Eventually they send me swaps for the Motherboard and CPU. No
go. Still hangs at random.

>From Sept. 2008 till Jan 2009 I had a total of 5  servers go bad. 5
out of 23 is close to 20% failure rate. Finally they agree to swap a
few servers in their entirety and this solved the problem for those
specific machines. I just  suspect the have a bad batch of SC1435's
but they say they  do not have any other reports.

Now I have a new machine go down and it's back to wasting my time
going all over those debugging procedures.

Do others face similar vendor issues? If 6 out of 23 machines go bad
within 8 months of an order can I expect the vendor to exchange the
rest too? Or do i have to wait for each machine to individually go
down? In spite of having paid for next-day service each time we have
waited more than a month while Dell goes through all the debugging
circus. The last straw was a Dell-tech-rep who chastened me today
responding:

 "The next day (service) refers to normal break fix issues that
involve normal parts, since we may be replacing an entire server it
may take longer".

To quantify; "longer" usually means a month+ for us.

 And a single bad machine causes larger problems since it usually
results in disrupting jobs that run spanning across a bunch of nodes
too. I'm just a grad-student here and without a dedicated sys-admin it
takes a lot of time running all these testing etc. that Dell demands;
I am ok running basic tests but if machines go bad during the warranty
is such testing within my domain or the vendors?

I haven't really pored at all the legalese in our contracts but is
there a "lemon-law" analog for computers? If 20% of the machines are
bad in the first one year do you think I can press for a better
resolution from Dell?

Just wanting to hear more about how I can best resolve this issue. For
our future purchases would changing vendors help? Is there any trend
behind the quality of services from different vendors? I have only
been exposed to Dell and its frustrating customer-service so far; are
HP / IBMd or any others better or worse or uncorrelated?Of course, I
do realize that ours is indeed a small setup by today's standards (we
just have  23 SC-1435's) so I am not really one of Dell's high-revenue
customers.

 Or is this just the way things are and I ought to resign myself to it
rather than fight it out!

-- 
Rahul