[Beowulf] Re: PowerEdge SC 1435: Unexplained Crashes.
rpnabar at gmail.com
Fri Oct 10 07:54:35 PDT 2008
>Have you checked in the baseboard management log to see if it is
>throwing an error.
Apparently the SC1435 does not have OpenManage. "Simple Computing" is
too simple to warrant that, I was told. They do have dset to look at
the ESM logs but not for CentOS nor Fedora. Redhat is their
"validated" [sic] OS. That's the only one they support. So I'm sort of
> Also check on the temperature of the machines. We
>have had some pretty wierd issues with ram and CPU quirkyness when
>they reach a high internal temperature. If you can do some poling
>using ipmi on the nodes to record the current temp and fan data over
>time so that you could see what it was at just before a crash you
>might be able to point it to an environmental situation.
I'll try ipmi. I was trying lm_sensors but apparantly it does not have
a driver for this chipset / motherboard combination. Not sure if its
an AMD Opteron specific driver issue or a
vendor-not-relesing-motherboard-specs issue (heard both versions on
the net). Anybody else had success using lm_sensors on the SC1435?
More information about the Beowulf