[Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?

John Bushnell john.bushnell at icb.ucsb.edu
Mon Apr 6 12:37:40 PDT 2009


Prentice Bisbal wrote:
> Mark Hahn wrote:
>> buying an extended warranty might help.  buying a shrink-wrapped cluster
>> might help too.
> Not really. My cluster was a "shrink-wrapped" cluster from Dell. Turns
> out Dell hired someone from a 3rd-party to actually turn on the cluster
> (for the first time) and install all the software (nothing more than a
> vanilla ROCKS installed, without even a queuing system!) *after* the
> cluster arrived at our site.
> An arrangement like this just muddies the situation even further. If I
> had a software problem, do I call cluster, or the 3rd-party hired to
> install the software?
> I think you mean "buy a shrink-wrapped cluster from a well-respected,
> cluster-specific vendor that has proven in-house cluster expertise"
I never call support until after I have diagnosed a problem myself as 
much as possible.  One of the advantages of buying a batch of nodes at 
once is that you can easily swap components between nodes to isolate the 
real problem.  You will find Dell support easier to deal with (or any 
other vendor for that matter) if you can concisely tell them all of the 
steps that you took to determine that component X needs replacement.  
Yes, I have had a bad vendor give me the run around, but the better 
information that you can put into an initial service call, the better 
service you will tend to receive.  If you just say "my node stopped 
working", they will assume that you don't know what you're doing.

I once had a really bad batch of nodes that were overheating and 
crashing within minutes of booting up, without any load whatsoever.  The 
vendor (who will never get my business again) kept telling me to reseat 
the cpu's, put them in a colder room, etc.  When I told the university 
not to pay the bill, I suddenly had several people from the company very 
interested in my problem.  (BTW, I ended up fixing the problem MYSELF by 
stuffing left over foam packing material in the open spaces between the 
front (cold) side and the back (hot) side of the servers.  Later the 
vendor admitted that it was a poorly designed system, but way too late 
for me to ever consider them again.  It was the worst several weeks of 
my life.)

    Sorry about your bad luck  -  John

More information about the Beowulf mailing list