Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Re: RAM ECC errors (Henning Fehrmann)

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Mark Hahn hahn at mcmaster.ca
Wed Feb 24 07:36:17 PST 2010


> Strangely enough, panic_on_ue is off by default.

this seems to be version-dependent (we have a bunch of HP XC clusters
that have panic_on_ue (and log_ce) enabled by default.  I didn't check
the sources to see whether HP had patched this, though.

> On some apparently broken hardware we have a rate of nearly one event
> per second. I assume the probability of having uncorrectable errors is

it's certainly possible to have periodic CEs (some page that gets accessed by
a periodic timer, etc).  but more likely, this is just the edac module's 
polling rate (/sys/devices/system/edac/mc/poll_msec = 1000, right?)

my experience is that if you're getting CE logs at 1 Hz, then you're 
actual CE rate is potentially a lot higher: there's a ce_noinfo_count
control which indicates when there are too many CEs per poll.  if you
really wanted to find the rate, you could crank up poll_msec, but 
my experience is that >1 Hz probably calls for a physical fix.

OTOH, I do observe machines where a reboot seems to make the CEs go away.
that's worrisome.  on other machines, reseating dimms does the trick
(also a bit worrisome, or at least annoying.)

I have a script that I run to summarize this stuff and decode the 
channel/row to dimm numbers.  for a cluster, I normally run "pdsh -a
collect_edac_stats | sort" to look for problem nodes...



More information about the Beowulf mailing list