[Beowulf] Re: RAM ECC errors (Henning Fehrmann)
carsten.aulbert at aei.mpg.de
Mon Feb 22 22:33:38 PST 2010
replying also on Henning's behalf
On Monday 22 February 2010 21:30:38 David Mathog wrote:
> Are you saying that now that you are monitoring you are seeing kernel
> panics which did not appear before?
No, but there seem to be a switch in the kernel module that allows to trigger
a kernel panic upon discovering uncorrectable errors.
> You can get some information through netconsole, but you know that already.
Yup already running, question is if a kernel panic would also be fully visible
via netconsole - we are glad that we rarely have those ;)
> Well, you could log process start/stops and flush them to disk or syslog
> them, so that at least when the system crashes it would be possible to
> derive a list of everything that was still running. Doubt this will
> help much though, since the most likely culprit is a bad stick of
> memory, in which case the netconsole or IPMI or MCE messages may be
> enough to figure out which stick is the problem. That is, whichever
> process triggered it is probably an innocent bystander.
Yes, but the memory of any process might get corrupted, thus this is more to
learn which user is currently running jobs. Which in turn enables us to notify
these users that this particular machine running these jobs had a problem and
the user might need to re-run her jobs to prevent "false" data entering her
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 1871 bytes
Desc: not available
More information about the Beowulf