[Beowulf] mcelog output, interpretation?
David Mathog
mathog at caltech.edu
Mon Aug 18 14:04:38 PDT 2008
Finally got around to running mcelog on a pair of IBM System x3455
machines which had occasional "machine check logged" lines in
/var/log/messages. One of them had 29 machine checks logged, all of
them variants of this:
MCE 27
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 2 4 northbridge TSC cb4140bc52cd6
MISC c0080ee200000000 ADDR 50de6e7a0
Northbridge RAM Chipkill ECC error
Chipkill ECC syndrome = f9b2
bit32 = err cpu0
bit46 = corrected ecc error
bit59 = misc error valid
bus error 'local node origin, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS 9c594001f9080813 MCGSTATUS 0
These had built up at about 1 per month over the last couple of years.
There seems to be an issue with the Northbridge, but exactly what that
is, and how serious it might be, is not greatly illuminated (at least
for me) by this information. If this indicates an occasional glitch
with the built in video, then it can be ignored. If it indicates some
issue with the handling of external memory that is of greater concern.
The other machine had no mcelog output at all. The MCE logs there were
cleared in the BIOS a couple of months ago when a defective memory stick
was swapped out. The machines have different BIOS versions, the one
with these messages (which does NOT support CPU frequency adjustment)
has the newer BIOS, which is:
Version: IBM BIOS Version 1.35-[C0E135AUS-1.35]-
Release Date: 02/26/2007
and the one which has not logged any of these (at least lately, and DOES
support CPU frequency adjustment) is:
Version: IBM BIOS Version 1.28-[C0E128AUS-1.28]-
Release Date: 09/14/2006
Thanks,
David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
More information about the Beowulf
mailing list