[Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?

Rahul Nabar rpnabar at gmail.com
Mon Apr 6 12:35:10 PDT 2009


On Mon, Apr 6, 2009 at 1:32 PM, Frank Gruellich
<frank.gruellich at navteq.com> wrote:
> IMHO SC1435 are some kind of low-cost metal from DELL.  I would not use
> them if I want a reliable system.  Especially in HPC where one failed
> systems ruins your whole (maybe long running) job.

Thanks for the comments Frank. I did not realize that the SC1435
wasn't suitable for HPC. I know it is one of the lower end systems
without RemoteManagement nor hot-swappable-hardware etc. (but we don't
really need the frills) but I was under the impression that this model
is fairly common in other HPC installations. Maybe we were wrong, in
hindsight.

>
> The DELL support is a bit tricky.  We have Silver or Gold support for
> most systems, I don't know how they work for lower levels.  I can't
> complain about Gold.  For Silver they always try to make us doing stuff
> like cross testing memory, CPU or other things.  (The most interesting
> request is to do a BIOS update to cure a (obviously) memory problem.
> The machine went 2 years fine with the old BIOS -- memory combination
> and suddenly it complains about it?) While I really like to do such
> hardware games I just don't have the time for it.  If you keep refusing
> these requests, eventually they give up and send a technican replacing
> different pieces of hardware.

I ought to check if we are "Gold" or "Silver" or none. Yes, the BIOS
update gig I am familiar with. I can quote their debug checklist from
memory almost. They made me confirm and update BIOSes too. It was
funny especially since it hadn't been even a month after we bought
them but the tech insisted our BIOS was *not* up-to-date back then. We
fixed it but I always wonder why they do not just ship out up-to-date
versions of the BIOS!


>
> We use CentOS for most installation and DELL support never complained
> about it.  And IMHO the OS should be able to cause an error detected by
> the management board.

Exactly, my opinion. It seems clearly a hardware level fault and the
OS angle seems mostly smoke-and-mirrors to me. I cannot explain why
the system will not reboot by pressing the reboot button if it were a
simple software crash.


>
> I have dset reports in place, before calling support, because they
> always request them.  That speeds up chit-chat a bit.

Yes, dset and sosreports seem standard requests.

>
> That's another problem: IMHO your university should have a dedicated guy
> taking care about computer system, someone who has the time to deal with
> DELL support and so on.  23 machines don't give a full time job, but
> maybe someone who's taking care about some other Linux installation
> already.  It's not a good idea to have just some grad-student doing that
> job part-time (no offense).  I know that reallity looks bad.

Ah well, one does what one needs to! :) These are dedicated research
machines for our computational chemistry group so they will be running
code that eventually (hopefully!) puts results into my PhD thesis! :)
Most parts of system administration are fun except maybe having to
deal with stubborn vendors!

-- 
Rahul




More information about the Beowulf mailing list