[Beowulf] Barcelona hardware error: how to detect
Greg Lindahl
lindahl at pbm.com
Thu Jun 5 11:30:20 PDT 2008
On Thu, Jun 05, 2008 at 10:09:58PM +0400, Mikhail Kuzminsky wrote:
> This was interesting for me also, because I
> have no information how this hardware problem may be affected in the
> "real life".
I have 4 chips with the bug, in 2 servers. I see about 1 lockup per
month with my workload, which doesn't include any VMs. (VMs are
reputed to trigger the bug quickly.) I found a webpage with the
details, and indeed this is what I see:
| The system may experience a machine check event reporting an L3
| protocol error has occurred. In this case, the MC4 status register
| (MSR 0000_0410) will be equal to B2000000_000B0C0F or
| BA000000_000B0C0F. The MC4 address register (MSR 0000_0412) will be
| equal to 26h.'
-- greg
More information about the Beowulf
mailing list