[Beowulf] Multisocket mainboard hardware problems

Chris Samuel csamuel at vpac.org
Thu Jan 15 18:54:41 PST 2009


----- "Francesco Pietra" <francesco.pietra at accademialucchese.it> wrote:

> Therefore, is it any software way to check if the CPUs are fully in
> order, including the memory controller? lshw and other software
> provided only partial help in my hands.

Make sure that you have ECC turned to MAX in your BIOS,
on our SuperMicro mainboards that enables scrubs of RAM
and CPU caches as well as spotting ECC memory errors.

For some reason the SuperMicro BIOS's we've had recently
have defaulted to turning ECC off which isn't particularly
useful, especially on motherboards that can only take ECC
memory!   We found that the hard way recently, and you
can work that out from the output of dmidecode like this:

dmidecode  | grep -A7 "Physical Memory Array" | grep "Error Correction"| grep  ECC

Make sure you're also running mcelog to pull any MCE
or ECC hardware reports that the kernel has recorded
from the CPUs out to a logfile.

We find that running it with the --k8 and --dmi options
is important to decode more information about these events.

cheers!
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency



More information about the Beowulf mailing list