[Beowulf] Curious about ECC vs non-ECC in practice
Tony Travis
a.travis at abdn.ac.uk
Fri May 20 08:35:45 PDT 2011
On 20/05/11 05:35, Joe Landman wrote:
> Hi folks
>
> Does anyone run a large-ish cluster without ECC ram? Or with ECC
> turned off at the motherboard level? I am curious if there are numbers
> of these, and what issues people encounter. I have some of my own data
> from smaller collections of systems, I am wondering about this for
> larger systems.
Hi, Joe.
I ran a small cluster of ~100 32-bit nodes witn non-ECC memory and it
was a nightmare, as Guy described in his email, until I pre-emptively
tested the memory in user-space, using Chlarles Cazabon's "memtester":
http://pyropus.ca/software/memtester
Prior to this, *all* the RAM had passed Memtest86+.
I had a strict policy that if a system crashed, for any reason, it was
re-tested with Memtest86+, then 100 passes of "memtester" before being
allowed to re-join the Beowulf cluster. This made the Beowulf much more
stable running openMosix. However, I've scrapped all our non-ECC nodes
now because the real worry is not knowing if an error has occurred...
Apparently this is still a big issue for computers in space, using
non-ECC RAM for solid-state storage on grounds of cost for imaging.
They, apparently, use RAM background SoftECC 'scrubbers' like this:
http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf
Bye,
Tony.
--
Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition
and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK
tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk
mailto:a.travis at abdn.ac.uk, http://bioinformatics.rri.sari.ac.uk
More information about the Beowulf
mailing list