[Beowulf] Logging MCE information on next warm boot?

Chris Samuel chris at csamuel.org
Mon Jan 25 16:48:50 PST 2010


Hi David,

Apologies for the personal copy but emails to the list from my new address are 
being moderated and I suspect the moderator is away at present..

On Tue, 26 Jan 2010 05:46:31 am David Mathog wrote:

> Is it possible to have the Machine Check Exception (MCE) information
> saved to disk automatically on the next warm boot?

Depending on your kernel version it may well do that by default, for instance 
both 2.6.20 and 2.6.28 (to pick at random from git) say:

        /* Log the machine checks left over from the previous reset.
           This also clears all registers */
        do_machine_check(NULL, mce_bootlog ? -1 : -2);


Greg mentions mcelog, well that will write output to a file but if that data 
doesn't make it to spinning rust before the machine locks up then you're out 
of luck as it'll have cleared the MCE log as part of its action. :-(

There is parsemce by Dave Jones [1], apparently you can parse through some of 
the parameters you get - for instance for your error I get:

$ ./parsemce -e 0000000000000007 -b 2 -a 00000000001511C0 -s 940040000000017A
Status: (7) Machine Check in progress.
Error IP valid
Restart IP valid.
parsebank(2): 940040000000017a @ 1511c0
        External tag parity error
        Correctable ECC error
        Address in addr register valid
        Error enabled in control register
        Memory heirarchy error
        Request: Generic error
        Transaction type : Generic
        Memory/IO : I/O

IIRC that means that you took a machine check whilst there was already a MCE 
happening, and that becomes an uncorrectable error and the box will die.

[1] - http://www.codemonkey.org.uk/projects/parsemce/parsemce.c

If you can upgrade to a current kernel (2.6.3x) you can enable the new EDAC 
code which will decode MCEs in the kernel and process/log them there which 
might yield better information for you (and might even make it to a remote 
syslog if they don't make it to the local platters).

Best of luck!
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

This email may come with a PGP signature as a file. Do not panic.
For more info see: http://en.wikipedia.org/wiki/OpenPGP
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 481 bytes
Desc: This is a digitally signed message part.
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100126/d32e4919/attachment.sig>


More information about the Beowulf mailing list