[Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?

Mark Hahn hahn at mcmaster.ca
Mon Apr 6 15:50:51 PDT 2009

> An arrangement like this just muddies the situation even further. If I
> had a software problem, do I call cluster, or the 3rd-party hired to
> install the software?

if you had a problem that could be clearly attributed to whoever
sold you the package, you should call them.  they sold it, and if 
what they sold included support, they're on the hook.  of course, 
they might claim you broke it, but that's always an option for a vendor
wanting to avoid support.

for us, we currently run HP's RHEL version (XC) on our HP clusters;
it includes Platform LSF.  when we have problems, we open a ticket 
with HP - whether they use their own inhouse expertise or punt 
to Platform is up to them.  same as for hardware, really.

> I think you mean "buy a shrink-wrapped cluster from a well-respected,
> cluster-specific vendor that has proven in-house cluster expertise"

no, I don't - what's required is a paid-for support contract
and a vendor who takes it seriously.  if I bought a cluster with 
support, I'd go straight to the legal dept if I thought the vendor
wasn't living up to the contract...

all vendors are, by nature and necessity, interested in avoiding 
support costs.  there are always hoops to jump through - mainly,
I think, to filter out the randoms.  that is, enough friction to 
cause you to think about reading the docs first, but also enough
layers of support to keep their heaviest tech from explaining vi ;)

regards, mark hahn.

More information about the Beowulf mailing list