[Beowulf] Re: RAM ECC errors (Henning Fehrmann)
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Mark Hahn hahn at mcmaster.caWed Feb 24 07:36:17 PST 2010
- Previous message: [Beowulf] Re: RAM ECC errors (Henning Fehrmann)
- Next message: [Beowulf] Computational / GPGPU Engineer at Life Technologies
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> Strangely enough, panic_on_ue is off by default. this seems to be version-dependent (we have a bunch of HP XC clusters that have panic_on_ue (and log_ce) enabled by default. I didn't check the sources to see whether HP had patched this, though. > On some apparently broken hardware we have a rate of nearly one event > per second. I assume the probability of having uncorrectable errors is it's certainly possible to have periodic CEs (some page that gets accessed by a periodic timer, etc). but more likely, this is just the edac module's polling rate (/sys/devices/system/edac/mc/poll_msec = 1000, right?) my experience is that if you're getting CE logs at 1 Hz, then you're actual CE rate is potentially a lot higher: there's a ce_noinfo_count control which indicates when there are too many CEs per poll. if you really wanted to find the rate, you could crank up poll_msec, but my experience is that >1 Hz probably calls for a physical fix. OTOH, I do observe machines where a reboot seems to make the CEs go away. that's worrisome. on other machines, reseating dimms does the trick (also a bit worrisome, or at least annoying.) I have a script that I run to summarize this stuff and decode the channel/row to dimm numbers. for a cluster, I normally run "pdsh -a collect_edac_stats | sort" to look for problem nodes...
- Previous message: [Beowulf] Re: RAM ECC errors (Henning Fehrmann)
- Next message: [Beowulf] Computational / GPGPU Engineer at Life Technologies
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
