[Beowulf] Re: RAM ECC errors

Tue Feb 23 23:20:43 PST 2010

Hi David,

Thank you for the response.

> Carsten Aulbert  wrote
> > > Are you saying that now that you are monitoring you are seeing kernel
> > > panics which did not appear before?
> > > 
> > 
> > No, but there seem to be a switch in the kernel module that allows to
> trigger 
> > a kernel panic upon discovering uncorrectable errors.
> 
> By "switch" do you mean:
> A. There is an option that may be set when that module is loaded which
> will then cause it to panic on an uncorrectable error, where normally it
> would not.
> B. There has been a change in the module code between kernel versions
> that causes it to panic now on events where it formerly did not panic.

It is A. There is a module parameter for edac_core:
edac_mc_panic_on_ue=1. We have not tested it yet since uncorrectable
errors rarely occur. 

> 
> > > You can get some information through netconsole, but you know that
> already.
> > > 
> > 
> > Yup already running, question is if a kernel panic would also be fully
> visible 
> > via netconsole - we are glad that we rarely have those ;)
> 
> I have seen one kernel panic since turning on netconsole, and it did log
> across the network and showed up in /var/log/messages as it was supposed
> to, with the same information presented as in the tests.  Limited data,
> but it would seem the answer is "at least sometimes".

I got a hint from one of the kernel developer. Including the show show_state()
function into panic.c right before dump_stack() should give process
information via printk which could be collected with netconsole. 
We are still waiting for an UE event.

> 
> > Yes, but the memory of any process might get corrupted, thus this is
> more to 
> > learn which user is currently running jobs. Which in turn enables us
> to notify 
> > these users that this particular machine running these jobs had a
> problem and 
> > the user might need to re-run her jobs to prevent "false" data
> entering her 
> > job.
> 
> If the node blows up presumably the output of all the jobs currently
> running there will clearly indicate that there was a failure - so you
> should not have to notify those users since they will see the problem in
> their results.  (Unless MPI, or PVM, or whatever is being used to spread
> jobs around, ignores fatal errors, which should never be the case.)  For
> jobs which completed earlier on the same node, this would have been
> before an uncorrectable error took place, so the results should be OK.  

Yes, this is correct. A panic should be enough to avoid corrupted data.
Often, jobs are failing for other reasons. A process list might help
us to exclude other possibilities for job failure. It makes the work a bit
more convenient.  

Cheers,
Henning