[Beowulf] ECC Memory and Job Failures

Thu Apr 23 09:37:58 PDT 2009

Huw,
I've seen similar cases. A not-to-be-named company that I worked at
decided to cut corners (and save cash) by purchasing a non-ECC cluster
to expand their processing systems. Needless to say, jobs failed or
returned incorrect results. All what you need to to do is multiply
utlization ( > 90%) * number of CPUs (or cores, these days) and then
divide by MTBF to find out how frequent these failures become.
In this case, it turned out to be something small but very telling per
hour across the cluster. And since the scheduler ran jobs in lots of 8
to 512 CPUs (some of which ran for days) and ran jobs across like
nodes (we had many generations of similar systems, but most differed
in clock speed, not to mention locality of data), all but the small
jobs consistently failed on this cluster. Simple logic, but -
naturally - something the authorizing manager didn't consider and
managed to weasel out of.
ECC + large systems = good.
Derek R.

On 4/23/09, Huw Lynes <lynesh at cardiff.ac.uk> wrote:
> Thought this might be of interest to others:
>
> http://blog.revolution-computing.com/2009/04/blame-it-on-cosmic-rays.html
>
> Apparently someone ran a large cluster job with both ECC and none-ECC
> RAM. They consistently got the wrong answer when foregoing ECC.
>
> I'd love to see the original data.
>
> Thanks,
> Huw
>
> --
> Huw Lynes                       | Advanced Research Computing
> HEC Sysadmin                    | Cardiff University
>                                 | Redwood Building,
> Tel: +44 (0) 29208 70626        | King Edward VII Avenue, CF10 3NB
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>

-- 
Sent from my mobile device