[Beowulf] Re: RAM ECC errors (Henning Fehrmann)

Tue Feb 23 23:30:31 PST 2010

Hi Mark,

On Tue, Feb 23, 2010 at 03:05:39PM -0500, Mark Hahn wrote:
> >No, but there seem to be a switch in the kernel module that allows to trigger
> >a kernel panic upon discovering uncorrectable errors.
> 
> I suspect you mean /sys/module/edac_mc/panic_on_ue
> (ue = uncorrected error).  I consider this very much the norm:
> it would be very strange to run with ECC memory, and ECC enabled,
> and not actually halt on UE.  UE represents a failure of the memory
> system, not just a transient event, but something which must be
> physically fixed.  even for HA situations, I'd be pretty skeptical
> about using a memory channel which had any UE's on it.

Strangely enough, panic_on_ue is off by default.
> 
> CE (corrected errors) OTOH, are very different.  they're almost just
> a heartbeat of your ECC subsystem.  yes, a CE indicates some event
> that needed correcting, but at a modest rate, CEs are acceptable.
> there are failure modes, though, where enough CEs eventually cause a
> UE: tracking CE rate is important for that reason.  (other UE modes
> don't have this warning sign...)

On some apparently broken hardware we have a rate of nearly one event
per second. I assume the probability of having uncorrectable errors is 
few orders of magnitude smaller than the rate of correctable errors
since more event  have to occur simultaneously. And hopefully, the rate
of a silent corruption is still smaller. 
> 
> you can set CEs to log through kernel->syslog via edac tunables in /sys.
> 
> >Yes, but the memory of any process might get corrupted, thus this is more to
> 
> if UE is set to panic, nothing will get corrupted (that's really the point eh?)

Correct, but it helps rule out other reasons for job failures.

Cheers,
Henning