[Beowulf] ECC Memory and Job Failures
jclinton at advancedclustering.com
Fri Apr 24 09:03:04 PDT 2009
On Fri, Apr 24, 2009 at 12:49 AM, John Hearns <hearnsj at googlemail.com>wrote:
> 2009/4/23 Nifty Tom Mitchell <niftyompi at niftyegg.com>:
> > On Thu, Apr 23, 2009 at 04:45:08PM +0100, Huw Lynes wrote:
> > IMO Running on a large cluster without multiple bit detection and a
> minimum of one bit
> > correction ECC is silly.
> > Further running without watching the ECC logs is also silly. Watching
> > logs can be hard to do.
> Yes indeed.
> At the risk of being an SGI fanboy again, obviously SGI Altix systems
> keep excellent logs of hardware errors in /var/log/salinfo - indeed we
> had a DIMM fail the day before yesterday, I sent off the traces, and
The EDAC drivers for Linux are able to do this for all x86_64 platforms up
to but not including Nehalem (a driver hasn't been released yet). With EDAC,
a whole slew of statistics are made available in /sys which can be used for
reporting, tracking and tracing the failing DIMM down to physical socket. In
fact, just a few weeks ago, AMD released 29 patches for Barcelona and
Shanghai. (Unfortunately, these new patches only build on 2.6.30-rc*.)
At Advanced Clustering, we use this reporting facility in our Breakin
software--we run BLAS-optimized linpack from a RAM filesystem and watch for
Jason D. Clinton, 913-643-0306
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf