[Beowulf] Memory Testing?
Mark Hahn
hahn at mcmaster.ca
Sat Aug 13 19:22:52 PDT 2011
> I'm curious if anyone has any experience with ECC uncorrectable errors
> (specifically not the identification of), but which specific dimm in
> the chassis it's pointing to.
we've had good luck using EDAC to pin down bad dimms -
at least those that that cause _correctable_ errors.
our uncorrectable errors trigger panics. I suppose that's selectable,
though I guess you could turn that off (/sys/module/edac_mc/panic_on_ue)
> The mcelog in linux doesn't seem to report the dimm slot correctly on
> my supermicro boards.
I prefer the hardware-topology-based naming that edac uses
(controller, channel, chipselect). I guess recent versions of edac
have a user-space tool that will translate that for you (but of course,
you have to verify the topo-to-label mapping yourself anyway.)
regards, mark hahn.
More information about the Beowulf
mailing list