[Beowulf] Re: RAM ECC errors (Henning Fehrmann)

Wed Feb 24 07:36:17 PST 2010

> Strangely enough, panic_on_ue is off by default.

this seems to be version-dependent (we have a bunch of HP XC clusters
that have panic_on_ue (and log_ce) enabled by default.  I didn't check
the sources to see whether HP had patched this, though.

> On some apparently broken hardware we have a rate of nearly one event
> per second. I assume the probability of having uncorrectable errors is

it's certainly possible to have periodic CEs (some page that gets accessed by
a periodic timer, etc).  but more likely, this is just the edac module's 
polling rate (/sys/devices/system/edac/mc/poll_msec = 1000, right?)

my experience is that if you're getting CE logs at 1 Hz, then you're 
actual CE rate is potentially a lot higher: there's a ce_noinfo_count
control which indicates when there are too many CEs per poll.  if you
really wanted to find the rate, you could crank up poll_msec, but 
my experience is that >1 Hz probably calls for a physical fix.

OTOH, I do observe machines where a reboot seems to make the CEs go away.
that's worrisome.  on other machines, reseating dimms does the trick
(also a bit worrisome, or at least annoying.)

I have a script that I run to summarize this stuff and decode the 
channel/row to dimm numbers.  for a cluster, I normally run "pdsh -a
collect_edac_stats | sort" to look for problem nodes...