[Beowulf] Barcelona hardware error: how to detect

Greg Lindahl lindahl at pbm.com
Thu Jun 5 11:30:20 PDT 2008


On Thu, Jun 05, 2008 at 10:09:58PM +0400, Mikhail Kuzminsky wrote:

> This was interesting for me also, because I 
> have no information how this hardware problem may be affected in the 
> "real life". 

I have 4 chips with the bug, in 2 servers. I see about 1 lockup per
month with my workload, which doesn't include any VMs. (VMs are
reputed to trigger the bug quickly.) I found a webpage with the
details, and indeed this is what I see:

| The system may experience a machine check event reporting an L3
| protocol error has occurred. In this case, the MC4 status register
| (MSR 0000_0410) will be equal to B2000000_000B0C0F or
| BA000000_000B0C0F. The MC4 address register (MSR 0000_0412) will be
| equal to 26h.'

-- greg






More information about the Beowulf mailing list