[Beowulf] Re: RAM ECC errors (Henning Fehrmann)
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Mark Hahn hahn at mcmaster.caTue Feb 23 12:05:39 PST 2010
- Previous message: [Beowulf] Re: RAM ECC errors (Henning Fehrmann)
- Next message: [Beowulf] Re: case (de)construction question
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> No, but there seem to be a switch in the kernel module that allows to trigger > a kernel panic upon discovering uncorrectable errors. I suspect you mean /sys/module/edac_mc/panic_on_ue (ue = uncorrected error). I consider this very much the norm: it would be very strange to run with ECC memory, and ECC enabled, and not actually halt on UE. UE represents a failure of the memory system, not just a transient event, but something which must be physically fixed. even for HA situations, I'd be pretty skeptical about using a memory channel which had any UE's on it. CE (corrected errors) OTOH, are very different. they're almost just a heartbeat of your ECC subsystem. yes, a CE indicates some event that needed correcting, but at a modest rate, CEs are acceptable. there are failure modes, though, where enough CEs eventually cause a UE: tracking CE rate is important for that reason. (other UE modes don't have this warning sign...) you can set CEs to log through kernel->syslog via edac tunables in /sys. > Yes, but the memory of any process might get corrupted, thus this is more to if UE is set to panic, nothing will get corrupted (that's really the point eh?)
- Previous message: [Beowulf] Re: RAM ECC errors (Henning Fehrmann)
- Next message: [Beowulf] Re: case (de)construction question
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
