[Beowulf] Re: PowerEdge SC 1435: Unexplained Crashes.
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Rahul Nabar rpnabar at gmail.comFri Oct 10 08:05:34 PDT 2008
- Previous message: [Beowulf] Re: PowerEdge SC 1435: Unexplained Crashes.
- Next message: [Beowulf] Re: PowerEdge SC 1435: Unexplained Crashes.
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
>(1) Tell your Dell salesman that you have asked for help on this problem on >a public mailing list for High Performance Computing. Tell him/her that you >need high level Dell support on this. There are Dell customers on this list. Thanks John. I will do that. A question: how likely is it that this is a software issue and not hardware from my symptoms? They keep harping on the fact that I am running a non-validated OS. We used to run Fedora. Now run CentOS. Same issues. They only support RedHat. I have a hard time being 100% certain but the more I see it the more I am convinced it is the hardware. >(2) Suspect the RAM. Ask some serious questions of your Dell support about >RAM compatibility - HPC applications stress the RAM. Ask, and ask again, if >the specific RAM chips you have are certified for that motherboard. Use >dmidecode to read out the manufacturer codes of the RAM modules - do you >have a mix of manufacturers? Very good idea. Never tried that. I will check. I assumed that they were all similar systems and I had compatible RAM since I bought it all packaged together. >Ask and ask again about BIOS updates being available for these machines. >We had a case once of HP machines - even though the BIOSes were versioned >the same on 200 machines, there were some differences - turns out you had to >go as far as checking the build date. >Get the very latest BIOS version you can. I have the latest. But that's only based on the version #. I will dig deeper. Could this be bad BIOS, though, from the symptoms? So, some code somewhere switches the state of that LED from blue to orange and if only I knew what the trigger was supposed to be. Someone had to write that! > >(3) The RAM will be the problem - but if you can keep notes and there are >specific machines which crash more than others point this out to Dell and >maybe suspect the PSUs being weak on those machines. Yes. The crashes seem to be very clustered. We have had 5 specific machines out of 23 crash repeatedly. We swapped the motherboard+cpus on those and they do not seem to have crashed again as yet. But the time scale is only about 2 weeks. So I am not very confident of the statistical significance of my conclusions. -- Rahul
- Previous message: [Beowulf] Re: PowerEdge SC 1435: Unexplained Crashes.
- Next message: [Beowulf] Re: PowerEdge SC 1435: Unexplained Crashes.
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
