[Beowulf] Re: cheap PCs this christmas

Jim Lux James.P.Lux at jpl.nasa.gov
Tue Nov 15 04:30:40 PST 2005

At 10:42 AM 12/1/2005, Tony Travis wrote:

>The problem with modern non-ECC COTS memory is that it doesn't even have 
>parity checking because that slows memory access down and reduces the 
>total memory capacity of a chip. I think there is some prospect of using 
>software to detect COTS memory errors. This is of interest to projects 
>involved in using large amounts of COTS memory instead of hard disks to 
>store data in space. I just wonder if anyone can come up with similar ways 
>of detecting COTS memory errors down here on the ground...

Some algorithms inherently have a way to check something.  Consider doing 
FFTs.. One can simultaneously compute one bin by the discrete technique 
(for instance, one can average all the points to get the "DC" term), you 
can then compare this against the output of FFT.

There's a fair amount of literature on this.

>I think you are probably right that the simplest solution is to run jobs 
>twice to confirm the results, and this is one of the strategies proposed 
>for 'massively' parallel computing in 'space' too. However, if memory 
>errors corrupt the openMosix kernel then you get into BIG trouble!

