[Beowulf] Re: recommendation on crash cart for a cluster
room:fullcluster KVM is not an option I suppose?
rpnabar at gmail.com
Fri Oct 9 10:17:59 PDT 2009
On Thu, Oct 8, 2009 at 5:55 PM, Greg Lindahl <lindahl at pbm.com> wrote:
> 1) Console logging. Your machine just crashed. No clue in
> /var/log/messages. "I wonder if it printed something on the console?"
> Answer: ipmi and conman (available in an rpm in Red Hat distros).
I was "planning" on using kdump and a crash-kernel for that. Note the
emphasis on "planning". I never got that working correctly. I got
started on kdump+kexec when exactly the same "node crashes for unkown
reasons and I have no output" problem.
Maybe IPMI gives you the same functionality. Interesting point for me
though: What's the pros and cons of IPMI-console-logging versus kdump
in such crash scenarios. Are they competitors? Is one better / easier
than the other?
> 2) Monitoring. Temp, fan speeds, power supply state, events. Answers
> the "why is the little red light on the front of the case lit?"
> question. You can get some of this via other software (lm_sensors),
> but I find ipmitool to suck less, and ipmitool accurately answers the
> red light question -- lm_sensors can only guess.
I see. Yes, you read me correctly: I was putting full faith in
lm_sensors to do this. Currently I have lm_sensors feedign
Temperatures to my nagios monitoring setup and has been working fine.
But I didn't grasp a practical point about lm_sensors sucking more
than IPMI. THat's interesting again: Aren't they taking data from the
same bus or counters? Or is this because the sensor details tend to be
proprietary so lm_sensors lags behind the Vendor implementations of
Because if open-source IPMI is also trying to log sensor stats its in
competition with open source lm_sensors (not to say this is bad or un
heard of for multiple open source projects getting the same thing
More information about the Beowulf