[Beowulf] PowerEdge SC 1435: Unexplained Crashes.
rlinesseagate at gmail.com
Thu Oct 9 13:18:09 PDT 2008
On Thu, Oct 9, 2008 at 3:20 PM, Rahul Nabar <rpnabar at gmail.com> wrote:
> I have a PowerEdge SC 1435 that has a strange problem. We bought about 23 of
> these for a cluster and machines have been failing in a somewhat random manner
> in a peculiar way:
> (1) Screen is blank. Front blue indicator turns steady orange.
> (2) Cannot get it to reboot by pressing (or keeping depressed) the power button
> (3) only way to reboot is to cycle the power.
> (4) After reboot machine works fine again , till after a few days same failure.
> Ran the dset and diagnostic CD but nothing relevant.
> Any tips what could be the faulty component? Or debug ideas? Right now I'm
> totally lost! Hardware / software? CPU / Motherboard / Power supply?
Have you checked in the baseboard management log to see if it is
throwing an error. Also check on the temperature of the machines. We
have had some pretty wierd issues with ram and CPU quirkyness when
they reach a high internal temperature. If you can do some poling
using ipmi on the nodes to record the current temp and fan data over
time so that you could see what it was at just before a crash you
might be able to point it to an environmental situation.
Hope this helps,
More information about the Beowulf