[Beowulf] Logging MCE information on next warm boot?
chris at csamuel.org
Mon Jan 25 16:48:50 PST 2010
Apologies for the personal copy but emails to the list from my new address are
being moderated and I suspect the moderator is away at present..
On Tue, 26 Jan 2010 05:46:31 am David Mathog wrote:
> Is it possible to have the Machine Check Exception (MCE) information
> saved to disk automatically on the next warm boot?
Depending on your kernel version it may well do that by default, for instance
both 2.6.20 and 2.6.28 (to pick at random from git) say:
/* Log the machine checks left over from the previous reset.
This also clears all registers */
do_machine_check(NULL, mce_bootlog ? -1 : -2);
Greg mentions mcelog, well that will write output to a file but if that data
doesn't make it to spinning rust before the machine locks up then you're out
of luck as it'll have cleared the MCE log as part of its action. :-(
There is parsemce by Dave Jones , apparently you can parse through some of
the parameters you get - for instance for your error I get:
$ ./parsemce -e 0000000000000007 -b 2 -a 00000000001511C0 -s 940040000000017A
Status: (7) Machine Check in progress.
Error IP valid
Restart IP valid.
parsebank(2): 940040000000017a @ 1511c0
External tag parity error
Correctable ECC error
Address in addr register valid
Error enabled in control register
Memory heirarchy error
Request: Generic error
Transaction type : Generic
Memory/IO : I/O
IIRC that means that you took a machine check whilst there was already a MCE
happening, and that becomes an uncorrectable error and the box will die.
 - http://www.codemonkey.org.uk/projects/parsemce/parsemce.c
If you can upgrade to a current kernel (2.6.3x) you can enable the new EDAC
code which will decode MCEs in the kernel and process/log them there which
might yield better information for you (and might even make it to a remote
syslog if they don't make it to the local platters).
Best of luck!
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
This email may come with a PGP signature as a file. Do not panic.
For more info see: http://en.wikipedia.org/wiki/OpenPGP
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 481 bytes
Desc: This is a digitally signed message part.
More information about the Beowulf