[Beowulf] Logging MCE information on next warm boot?

David Mathog mathog at caltech.edu
Mon Jan 25 10:46:31 PST 2010

Is it possible to have the Machine Check Exception (MCE) information
saved to disk automatically on the next warm boot?

Long form:

A K7 node crashed yesterday and left an MCE on the screen which I copied
down as:

CPU 0 machine check exception 0000000000000007
Bank 1 F000000000000853
Bank 2 940040000000017A at 00000000001511C0
Kernel panic, not syncing, Unable to Continue

Copying all of those numbers down is very error prone.  As I understand
it the MCE values stay in the registers of the CPU after the crash, and
may be retrieved at the next warm boot (via a front panel reset, for
instance).  But this save seems not to happen automatically, or at least
I could not find anything that looked like an MCE dump in /var/log or
/var/log/kernel when the system came up.  So I want to set things up, if
possible to save this information to disk.

For what its worth, this is on a Tyan S2466, and while on the next warm
boot the hardware monitor in the BIOS showed the CPU fan at full speed,
when the OS came up lm_sensors showed it at half speed.  I have seen
this glitch before on other mysterious crashes, and the only way to
clear it seems to be to unplug the unit for 10 minutes, allowing time
for the errant bit fade away.  This is on a kernel.


David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

