[Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?
john.bushnell at icb.ucsb.edu
Mon Apr 6 12:37:40 PDT 2009
Prentice Bisbal wrote:
> Mark Hahn wrote:
>> buying an extended warranty might help. buying a shrink-wrapped cluster
>> might help too.
> Not really. My cluster was a "shrink-wrapped" cluster from Dell. Turns
> out Dell hired someone from a 3rd-party to actually turn on the cluster
> (for the first time) and install all the software (nothing more than a
> vanilla ROCKS installed, without even a queuing system!) *after* the
> cluster arrived at our site.
> An arrangement like this just muddies the situation even further. If I
> had a software problem, do I call cluster, or the 3rd-party hired to
> install the software?
> I think you mean "buy a shrink-wrapped cluster from a well-respected,
> cluster-specific vendor that has proven in-house cluster expertise"
I never call support until after I have diagnosed a problem myself as
much as possible. One of the advantages of buying a batch of nodes at
once is that you can easily swap components between nodes to isolate the
real problem. You will find Dell support easier to deal with (or any
other vendor for that matter) if you can concisely tell them all of the
steps that you took to determine that component X needs replacement.
Yes, I have had a bad vendor give me the run around, but the better
information that you can put into an initial service call, the better
service you will tend to receive. If you just say "my node stopped
working", they will assume that you don't know what you're doing.
I once had a really bad batch of nodes that were overheating and
crashing within minutes of booting up, without any load whatsoever. The
vendor (who will never get my business again) kept telling me to reseat
the cpu's, put them in a colder room, etc. When I told the university
not to pay the bill, I suddenly had several people from the company very
interested in my problem. (BTW, I ended up fixing the problem MYSELF by
stuffing left over foam packing material in the open spaces between the
front (cold) side and the back (hot) side of the servers. Later the
vendor admitted that it was a poorly designed system, but way too late
for me to ever consider them again. It was the worst several weeks of
Sorry about your bad luck - John
More information about the Beowulf