[Beowulf] Re: RAM ECC errors (Henning Fehrmann)
Mark Hahn
hahn at mcmaster.ca
Wed Feb 24 07:36:17 PST 2010
> Strangely enough, panic_on_ue is off by default.
this seems to be version-dependent (we have a bunch of HP XC clusters
that have panic_on_ue (and log_ce) enabled by default. I didn't check
the sources to see whether HP had patched this, though.
> On some apparently broken hardware we have a rate of nearly one event
> per second. I assume the probability of having uncorrectable errors is
it's certainly possible to have periodic CEs (some page that gets accessed by
a periodic timer, etc). but more likely, this is just the edac module's
polling rate (/sys/devices/system/edac/mc/poll_msec = 1000, right?)
my experience is that if you're getting CE logs at 1 Hz, then you're
actual CE rate is potentially a lot higher: there's a ce_noinfo_count
control which indicates when there are too many CEs per poll. if you
really wanted to find the rate, you could crank up poll_msec, but
my experience is that >1 Hz probably calls for a physical fix.
OTOH, I do observe machines where a reboot seems to make the CEs go away.
that's worrisome. on other machines, reseating dimms does the trick
(also a bit worrisome, or at least annoying.)
I have a script that I run to summarize this stuff and decode the
channel/row to dimm numbers. for a cluster, I normally run "pdsh -a
collect_edac_stats | sort" to look for problem nodes...
More information about the Beowulf
mailing list