[Beowulf] Memory Testing?

Mark Hahn hahn at mcmaster.ca
Sat Aug 13 19:22:52 PDT 2011

> I'm curious if anyone has any experience with ECC uncorrectable errors
> (specifically not the identification of), but which specific dimm in
> the chassis it's pointing to.

we've had good luck using EDAC to pin down bad dimms -
at least those that that cause _correctable_ errors.
our uncorrectable errors trigger panics.  I suppose that's selectable,
though I guess you could turn that off (/sys/module/edac_mc/panic_on_ue)

> The mcelog in linux doesn't seem to report the dimm slot correctly on
> my supermicro boards.

I prefer the hardware-topology-based naming that edac uses
(controller, channel, chipselect).  I guess recent versions of edac
have a user-space tool that will translate that for you (but of course,
you have to verify the topo-to-label mapping yourself anyway.)

regards, mark hahn.

