[Beowulf] mcelog output, interpretation?

David Mathog mathog at caltech.edu
Mon Aug 18 14:04:38 PDT 2008


Finally got around to running mcelog on a pair of IBM System x3455
machines which had occasional "machine check logged" lines in
/var/log/messages.  One of them had 29 machine checks logged, all of
them variants of this:

MCE 27
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 2 4 northbridge TSC cb4140bc52cd6
MISC c0080ee200000000 ADDR 50de6e7a0 
  Northbridge RAM Chipkill ECC error
  Chipkill ECC syndrome = f9b2
       bit32 = err cpu0
       bit46 = corrected ecc error
       bit59 = misc error valid
  bus error 'local node origin, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS 9c594001f9080813 MCGSTATUS 0

These had built up at about 1 per month over the last couple of years.

There seems to be an issue with the Northbridge, but exactly what that
is, and how serious it might be, is not greatly illuminated (at least
for me) by this information.  If this indicates an occasional glitch
with the built in video, then it can be ignored.  If it indicates some
issue with the handling of external memory that is of greater concern.

The other machine had no mcelog output at all.  The MCE logs there were
cleared in the BIOS a couple of months ago when a defective memory stick
was swapped out.  The machines have different BIOS versions, the one
with these messages (which does NOT support CPU frequency adjustment)
has the newer BIOS, which is:

	Version: IBM BIOS Version 1.35-[C0E135AUS-1.35]-
	Release Date: 02/26/2007

and the one which has not logged any of these (at least lately, and DOES
support CPU frequency adjustment) is:

	Version: IBM BIOS Version 1.28-[C0E128AUS-1.28]-
	Release Date: 09/14/2006

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech



More information about the Beowulf mailing list