[Beowulf] Re: RAM ECC errors (Henning Fehrmann)
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Carsten Aulbert carsten.aulbert at aei.mpg.deMon Feb 22 22:33:38 PST 2010
- Previous message: [Beowulf] Re: RAM ECC errors (Henning Fehrmann)
- Next message: [Beowulf] Re: RAM ECC errors (Henning Fehrmann)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi David replying also on Henning's behalf On Monday 22 February 2010 21:30:38 David Mathog wrote: > > Are you saying that now that you are monitoring you are seeing kernel > panics which did not appear before? > No, but there seem to be a switch in the kernel module that allows to trigger a kernel panic upon discovering uncorrectable errors. > You can get some information through netconsole, but you know that already. > Yup already running, question is if a kernel panic would also be fully visible via netconsole - we are glad that we rarely have those ;) > Well, you could log process start/stops and flush them to disk or syslog > them, so that at least when the system crashes it would be possible to > derive a list of everything that was still running. Doubt this will > help much though, since the most likely culprit is a bad stick of > memory, in which case the netconsole or IPMI or MCE messages may be > enough to figure out which stick is the problem. That is, whichever > process triggered it is probably an innocent bystander. Yes, but the memory of any process might get corrupted, thus this is more to learn which user is currently running jobs. Which in turn enables us to notify these users that this particular machine running these jobs had a problem and the user might need to re-run her jobs to prevent "false" data entering her job. Cheers Carsten -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1871 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20100223/35c1c67b/smime.bin
- Previous message: [Beowulf] Re: RAM ECC errors (Henning Fehrmann)
- Next message: [Beowulf] Re: RAM ECC errors (Henning Fehrmann)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
