[Beowulf] Logging MCE information on next warm boot?
henning.fehrmann at aei.mpg.de
Mon Jan 25 23:58:40 PST 2010
On Mon, Jan 25, 2010 at 10:46:31AM -0800, David Mathog wrote:
> Is it possible to have the Machine Check Exception (MCE) information
> saved to disk automatically on the next warm boot?
> Long form:
> A K7 node crashed yesterday and left an MCE on the screen which I copied
> down as:
> CPU 0 machine check exception 0000000000000007
> Bank 1 F000000000000853
> Bank 2 940040000000017A at 00000000001511C0
> Kernel panic, not syncing, Unable to Continue
> Copying all of those numbers down is very error prone. As I understand
> it the MCE values stay in the registers of the CPU after the crash, and
> may be retrieved at the next warm boot (via a front panel reset, for
> instance). But this save seems not to happen automatically, or at least
> I could not find anything that looked like an MCE dump in /var/log or
> /var/log/kernel when the system came up. So I want to set things up, if
> possible to save this information to disk.
We loaded the netconsole module. This works at least for the
2.6.27 kernel. AFAIK for older kernel one has to compile it into the kernel.
It sends printk messages to a remote syslog-ng server which collects
the information. I don't know how much netconsole sends in the case of a
netconsole needs paramter:
modprobe netconsole netconsole=own_port at onw_ip/NIC,remote_port at remote_IP/remote_mac
More information about the Beowulf