[Beowulf] ECC Memory and Job Failures

Thu Apr 23 22:49:23 PDT 2009

2009/4/23 Nifty Tom Mitchell <niftyompi at niftyegg.com>:
> On Thu, Apr 23, 2009 at 04:45:08PM +0100, Huw Lynes wrote:
>
>
> IMO Running on a large cluster without multiple bit detection and a minimum of one bit
> correction ECC is silly.
>
> Further running without watching the ECC logs is also silly.  Watching the
> logs can be hard to do.

Yes indeed.
At the risk of being an SGI fanboy again, obviously SGI Altix systems
keep excellent logs of hardware errors in /var/log/salinfo - indeed we
had a DIMM fail the day before yesterday, I sent off the traces, and
an engineer was on site yesterday to change it. If ESP email was able
to squeak its way out of our network I probably would have met the
engineer on the way into work before I called them.

More relevantly there is excellent memory error detection and logging
on the ICE cluster. SGI provide a utility for switching on memory
error logging, using the 'worm' module and logging all errors to
syslog. As the blades all do central syslogging to their rack leaders
you can track the errors readily. You don't even have to run your own
script to parse through logs - the 'memcheck' utility will check
through your entire system and report memory logs.
This facility has recently been very, very useful to me, and I've been
very grateful for SGI support.
Having experienced many other clusters, I think I can say that the SGI
attention to error logging like this is second to none.

Plus couple that with command-line utilities to flash BMC, CMC and
BIOSes and you've got a winner.