[Beowulf] Re: RAM ECC errors (Henning Fehrmann)

David Mathog mathog at caltech.edu
Mon Feb 22 12:30:38 PST 2010


 Henning Fehrmann <henning.fehrmann at aei.mpg.de> wrote:
> we started monitoring the rate of correctable errors appearing in the RAM.
> We also observed few uncorrectable errors. The corresponding kernel
> module 'edac_core' can cause a Kernel Panic when such an event occurs,
> which makes sense to avoid corrupted results. 

Are you saying that now that you are monitoring you are seeing kernel
panics which did not appear before?

> 
> Is there a way to get some useful information before the kernel panics?

You can get some information through netconsole, but you know that already.

> In particular are we looking for the process list to find out which
> user was running what before the UE errors occurred. 

Well, you could log process start/stops and flush them to disk or syslog
them, so that at least when the system crashes it would be possible to
derive a list of everything that was still running.  Doubt this will
help much though, since the most likely culprit is a bad stick of
memory, in which case the netconsole or IPMI or MCE messages may be
enough to figure out which stick is the problem.  That is, whichever
process triggered it is probably an innocent bystander.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech



More information about the Beowulf mailing list