[Beowulf] Not quite Walmart, or, living without ECC?

Tony Travis ajt at rri.sari.ac.uk
Mon Nov 26 16:02:52 PST 2007


David Mathog wrote:
 > I ran a little test over the Thanksgiving holiday to see how common
 > random errors in nonECC memory are.  I used the memtest86+ bit fade test
 > mode, which writes all 1s, waits 90 minutes, checks the result, then
 > does the same thing for all 0s.   Anyway, this was the best test I could
 > find for detecting the occasional gamma ray type data loss event.  The
 > [...]

Hello, David.

Memtest86+ is fine for 'burn-in' tests, but it does not do a realistic 
memory stress test under the conditions that normal applications run. I 
  test new non-ECC compute nodes by booting memtest86+ and running it 
for 24h. If there are no errors I reboot into Linux and run memtester. 
I've found memory that passes a 24h memtest86+ test, but fails memtester:

     http://pyropus.ca/software/memtester/

If one of our compute node crashes in when use it is re-tested the same 
way before being allowed to rejoin the openMosix cluster. It is possible 
  that faults detected by memtester are caused by other components such 
as CPU's overheating or PSU's struggling to provide enough power but the 
important point is these problems affect applications in a similar way.

All the compute nodes in our Beowulf cluster have to pass 24h Memtest86+ 
clean, followed by 100 memtester runs on 128MB RAM before being trusted 
to accept openMosix migrated processes, or to be used as LAM MPI hosts.

Best wishes,

     Tony.
-- 
Dr. A.J.Travis,                     |  mailto:ajt at rri.sari.ac.uk
Rowett Research Institute,          |    http://www.rri.sari.ac.uk/~ajt
Greenburn Road, Bucksburn,          |   phone:+44 (0)1224 712751
Aberdeen AB21 9SB, Scotland, UK.    |     fax:+44 (0)1224 716687



More information about the Beowulf mailing list