[Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?

Mark Hahn hahn at mcmaster.ca
Mon Apr 6 10:37:11 PDT 2009

> I put these machines into production in Aug '08. Within a month we had
> the first machine go bad. They hang with a amber LED and the

what's the term of the warranty?

> logging-module clearly logs an error of the sort: "Voltage sensor
> (VCORE) critical error. State asserted CPU2". Machine needs a
> power-cycle physically from back-plane to restart

well, I think it's worth asking whether you're sure your power feed
is in good shape.

> Do others face similar vendor issues? If 6 out of 23 machines go bad
> within 8 months of an order can I expect the vendor to exchange the
> rest too?

IMO, no.  not without some indication that the fault is well reproducable
and actually fault is theirs...

> And a single bad machine causes larger problems since it usually
> results in disrupting jobs that run spanning across a bunch of nodes
> too.

well, if you bought it as a cluster, not just some nodes,
then you might have a case that the cluster is not working.
the problem with replicability is that it permits fingerpointing.

> Just wanting to hear more about how I can best resolve this issue. For
> our future purchases would changing vendors help? Is there any trend

buying an extended warranty might help.  buying a shrink-wrapped cluster
might help too.

> behind the quality of services from different vendors? I have only
> been exposed to Dell and its frustrating customer-service so far; are
> HP / IBMd or any others better or worse or uncorrelated?Of course, I

my organization has been an HP shop, more or less, since inception in 2001,
for reasons I won't go into.  I believe they've done well by us - I could 
criticize prices, some hardware design issues, etc, but they're quite 
responsible and responsive to problems.

