[Beowulf] Re: RAM ECC errors (Henning Fehrmann)
    Mark Hahn 
    hahn at mcmaster.ca
       
    Wed Feb 24 07:36:17 PST 2010
    
    
  
> Strangely enough, panic_on_ue is off by default.
this seems to be version-dependent (we have a bunch of HP XC clusters
that have panic_on_ue (and log_ce) enabled by default.  I didn't check
the sources to see whether HP had patched this, though.
> On some apparently broken hardware we have a rate of nearly one event
> per second. I assume the probability of having uncorrectable errors is
it's certainly possible to have periodic CEs (some page that gets accessed by
a periodic timer, etc).  but more likely, this is just the edac module's 
polling rate (/sys/devices/system/edac/mc/poll_msec = 1000, right?)
my experience is that if you're getting CE logs at 1 Hz, then you're 
actual CE rate is potentially a lot higher: there's a ce_noinfo_count
control which indicates when there are too many CEs per poll.  if you
really wanted to find the rate, you could crank up poll_msec, but 
my experience is that >1 Hz probably calls for a physical fix.
OTOH, I do observe machines where a reboot seems to make the CEs go away.
that's worrisome.  on other machines, reseating dimms does the trick
(also a bit worrisome, or at least annoying.)
I have a script that I run to summarize this stuff and decode the 
channel/row to dimm numbers.  for a cluster, I normally run "pdsh -a
collect_edac_stats | sort" to look for problem nodes...
    
    
More information about the Beowulf
mailing list