[Beowulf] Multisocket mainboard hardware problems

Thomas Vixel tvixel at gmail.com
Fri Jan 16 09:52:39 PST 2009


It's somewhat of a stretch since you say it *suddenly* lost the use of
the bank of memory, but it could be that the processor for that
particular bank of memory isn't properly seated.

We've had two such systems in the past couple years, with the first
memtest86 kept reporting errors at consecutive addresses after it
crossed the memory boundary to where the affected processor's memory
controller took over. I swapped out memory modules, fiddled with
memory settings, and re-arranged the cards all to no avail. The final
thing I did that ended up fixing the problem was taking the processors
out and seating each in the others' slot.

At that point I had figured the processor itself was damaged, and that
I'd surely get errors in the other side of the memory region but to my
surprise I did not.

The second system flat out refused to recognize one whole bank of
memory like yours. After swapping out the memory didn't work, I tried
the processor swapping trick and it worked perfectly afterward. Even
swapping them back so they were in their original arrangement worked
the second time.

So, if all else fails you may want to try swapping the processors
around or reseating them. It just might save you some headaches
dealing with SuperMicro's RMA & tech support departments.

On 1/15/09, Chris Samuel <csamuel at vpac.org> wrote:
>
> ----- "Francesco Pietra" <francesco.pietra at accademialucchese.it> wrote:
>
>> Therefore, is it any software way to check if the CPUs are fully in
>> order, including the memory controller? lshw and other software
>> provided only partial help in my hands.
>
> Make sure that you have ECC turned to MAX in your BIOS,
> on our SuperMicro mainboards that enables scrubs of RAM
> and CPU caches as well as spotting ECC memory errors.
>
> For some reason the SuperMicro BIOS's we've had recently
> have defaulted to turning ECC off which isn't particularly
> useful, especially on motherboards that can only take ECC
> memory!   We found that the hard way recently, and you
> can work that out from the output of dmidecode like this:
>
> dmidecode  | grep -A7 "Physical Memory Array" | grep "Error Correction"|
> grep  ECC
>
> Make sure you're also running mcelog to pull any MCE
> or ECC hardware reports that the kernel has recorded
> from the CPUs out to a logfile.
>
> We find that running it with the --k8 and --dmi options
> is important to decode more information about these events.
>
> cheers!
> Chris
> --
> Christopher Samuel - (03) 9925 4751 - Systems Manager
>  The Victorian Partnership for Advanced Computing
>  P.O. Box 201, Carlton South, VIC 3053, Australia
> VPAC is a not-for-profit Registered Research Agency
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>



More information about the Beowulf mailing list