[Beowulf] Re: RAM ECC errors

David Mathog mathog at caltech.edu
Tue Feb 23 09:05:30 PST 2010


Carsten Aulbert  wrote
> > Are you saying that now that you are monitoring you are seeing kernel
> > panics which did not appear before?
> > 
> 
> No, but there seem to be a switch in the kernel module that allows to
trigger 
> a kernel panic upon discovering uncorrectable errors.

By "switch" do you mean:
A. There is an option that may be set when that module is loaded which
will then cause it to panic on an uncorrectable error, where normally it
would not.
B. There has been a change in the module code between kernel versions
that causes it to panic now on events where it formerly did not panic.

> > You can get some information through netconsole, but you know that
already.
> > 
> 
> Yup already running, question is if a kernel panic would also be fully
visible 
> via netconsole - we are glad that we rarely have those ;)

I have seen one kernel panic since turning on netconsole, and it did log
across the network and showed up in /var/log/messages as it was supposed
to, with the same information presented as in the tests.  Limited data,
but it would seem the answer is "at least sometimes".

> Yes, but the memory of any process might get corrupted, thus this is
more to 
> learn which user is currently running jobs. Which in turn enables us
to notify 
> these users that this particular machine running these jobs had a
problem and 
> the user might need to re-run her jobs to prevent "false" data
entering her 
> job.

If the node blows up presumably the output of all the jobs currently
running there will clearly indicate that there was a failure - so you
should not have to notify those users since they will see the problem in
their results.  (Unless MPI, or PVM, or whatever is being used to spread
jobs around, ignores fatal errors, which should never be the case.)  For
jobs which completed earlier on the same node, this would have been
before an uncorrectable error took place, so the results should be OK.  

Or am I missing something?

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech



More information about the Beowulf mailing list