[Beowulf] cheap PCs this christmas

Mark Hahn hahn at physics.mcmaster.ca
Tue Nov 22 20:58:09 PST 2005


> > I'm interested to know about other people's views and experiences of
> > the reliability of COTS (i.e. non-ECC) memory?

reliability is always a gamble; reducing risk always means increasing
cost and/or decreasing performance.  the amount you decrease risk through
techniques like ECC can be large or small, depending on your configuration.

> My view has always been to use ECC memory.

the comfort factor of ECC always has to be balanced against the missed
opportunity cost of paying more.

> Aside from non-ECC memory being cheaper, I see no benefits of using it
> when one accounts for downtime, troubleshooting, paying for replacement
> RAM, and worse getting wrong results.

this implies that you see enough ECC detections to produce a significant 
sample.  that implies that you probably have both a high-altitude facility
and have very large amounts of ram in use.

> Honestly, I never knew that not using ECC RAM on anything besides a
> nonessential system like a standard desktop configuration was ever an
> option.

I find that the use of "nonessential" often indicates rather poor reasoning
about the risks (and costs) involved.  a statistically-grounded approach
would treat memory size and perhap activity more than whether something is 
"desktop" or "server".

that said, our servers all have ECC.  on our current ~500 cpus and ~800GB,
I'd guess we see O(10) corruptions/year.  going to 7500 cores and >14TB,
(all with ECC) I'm pretty happy not to be risking undetected corruptions.

still, for some workloads, especially for leaner facilities (lower memory, 
less budget spent on network and storage), I'd certainly want to consider 
non-ECC.  I only wish vendors would publish their FIT figures, so we could
crunch the numbers properly.

more to the point, if you're going to network $300 PCs, ECC should almost
certainly not be on your xmas list...




More information about the Beowulf mailing list