[Beowulf] Re: RAM ECC errors (Henning Fehrmann)

Mark Hahn hahn at mcmaster.ca
Tue Feb 23 12:05:39 PST 2010

> No, but there seem to be a switch in the kernel module that allows to trigger
> a kernel panic upon discovering uncorrectable errors.

I suspect you mean /sys/module/edac_mc/panic_on_ue
(ue = uncorrected error).  I consider this very much the norm:
it would be very strange to run with ECC memory, and ECC enabled,
and not actually halt on UE.  UE represents a failure of the memory
system, not just a transient event, but something which must be 
physically fixed.  even for HA situations, I'd be pretty skeptical
about using a memory channel which had any UE's on it.

CE (corrected errors) OTOH, are very different.  they're almost just 
a heartbeat of your ECC subsystem.  yes, a CE indicates some event 
that needed correcting, but at a modest rate, CEs are acceptable.
there are failure modes, though, where enough CEs eventually cause 
a UE: tracking CE rate is important for that reason.  (other UE modes
don't have this warning sign...)

you can set CEs to log through kernel->syslog via edac tunables in /sys.

> Yes, but the memory of any process might get corrupted, thus this is more to

if UE is set to panic, nothing will get corrupted (that's really the point eh?)

