[Beowulf] PowerEdge SC 1435: Unexplained Crashes.

John Hearns hearnsj at googlemail.com
Thu Oct 9 22:54:03 PDT 2008


2008/10/9 Rahul Nabar <rpnabar at gmail.com>

> I
>
> I posted this problem on a PowerEdge mailing list but haven't gotten
> very far yet. Any suggestions are appreciated!
>
> Yes.

(1) Tell your Dell salesman that you have asked for help on this problem on
a public mailing list for High Performance Computing. Tell him/her that you
need high level Dell support on this. There are Dell customers on this list.

(2) Suspect the RAM. Ask some serious questions of your Dell support about
RAM compatibility - HPC applications stress the RAM. Ask, and ask again, if
the specific RAM chips you have are certified for that motherboard. Use
dmidecode to read out the manufacturer codes of the RAM modules - do you
have a mix of manufacturers?
Ask and ask again about BIOS updates being available for these machines.
We had a case once of HP machines - even though the BIOSes were versioned
the same on 200 machines, there were some differences - turns out you had to
go as far as checking the build date.
Get the very latest BIOS version you can.

(3) The RAM will be the problem - but if you can keep notes and there are
specific machines which crash more than others point this out to Dell and
maybe suspect the PSUs being weak on those machines.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20081010/c98637a9/attachment.html>


More information about the Beowulf mailing list