[Beowulf] Re: RAM ECC errors
henning.fehrmann at aei.mpg.de
Tue Feb 23 23:20:43 PST 2010
Thank you for the response.
> Carsten Aulbert wrote
> > > Are you saying that now that you are monitoring you are seeing kernel
> > > panics which did not appear before?
> > >
> > No, but there seem to be a switch in the kernel module that allows to
> > a kernel panic upon discovering uncorrectable errors.
> By "switch" do you mean:
> A. There is an option that may be set when that module is loaded which
> will then cause it to panic on an uncorrectable error, where normally it
> would not.
> B. There has been a change in the module code between kernel versions
> that causes it to panic now on events where it formerly did not panic.
It is A. There is a module parameter for edac_core:
edac_mc_panic_on_ue=1. We have not tested it yet since uncorrectable
errors rarely occur.
> > > You can get some information through netconsole, but you know that
> > >
> > Yup already running, question is if a kernel panic would also be fully
> > via netconsole - we are glad that we rarely have those ;)
> I have seen one kernel panic since turning on netconsole, and it did log
> across the network and showed up in /var/log/messages as it was supposed
> to, with the same information presented as in the tests. Limited data,
> but it would seem the answer is "at least sometimes".
I got a hint from one of the kernel developer. Including the show show_state()
function into panic.c right before dump_stack() should give process
information via printk which could be collected with netconsole.
We are still waiting for an UE event.
> > Yes, but the memory of any process might get corrupted, thus this is
> more to
> > learn which user is currently running jobs. Which in turn enables us
> to notify
> > these users that this particular machine running these jobs had a
> problem and
> > the user might need to re-run her jobs to prevent "false" data
> entering her
> > job.
> If the node blows up presumably the output of all the jobs currently
> running there will clearly indicate that there was a failure - so you
> should not have to notify those users since they will see the problem in
> their results. (Unless MPI, or PVM, or whatever is being used to spread
> jobs around, ignores fatal errors, which should never be the case.) For
> jobs which completed earlier on the same node, this would have been
> before an uncorrectable error took place, so the results should be OK.
Yes, this is correct. A panic should be enough to avoid corrupted data.
Often, jobs are failing for other reasons. A process list might help
us to exclude other possibilities for job failure. It makes the work a bit
More information about the Beowulf