[Beowulf] Stress / torture test cluster hardware
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Andrew Shewmaker agshew at gmail.comSat Oct 7 21:26:52 PDT 2006
- Previous message: [Beowulf] Stress / torture test cluster hardware
- Next message: [Beowulf] Stress / torture test cluster hardware
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 10/7/06, Nico Mittenzwey <nico.mittenzwey at s2001.tu-chemnitz.de> wrote: > "memtest86" http://www.memtest86.com/ If you are using a large amount of ECC memory, you may find it necessary to keep track of Single Bit Errors and look for "weak" DIMMs using something like the EDAC/bluesmoke drivers (http://bluesmoke.sourceforge.net) and a userspace memory tester. On a 264 node cluster with 8-16GB RAM, I had to weed out weak memory over a period of months. A given node running a memory tester would show no SBEs for a day or more, then suddenly show a huge burst. Another system might have a more consistently incrementing SBE counter. Now, ECC was working, so applications like the memory tester weren't having problems. However, I couldn't reliably reboot this cluster because the BIOS would often refuse to boot unless a node was powered off for say, five minutes. I wrote up more of this experience on the Real World Tech forum: http://www.realworldtech.com/forums/index.cfm?action=detail&id=69894&threadid=69639&roomid=11 It looks like the latest version of Stresslinux has a 2.6.16.18 kernel, so it should have the EDAC drivers included. Plus, it has the userspace memtester. Memtest86 is nice, but it didn't support checking the ECC counters on the the cluster I mention above. It couldn't help me weed out DIMMs at all. See http://agenda.clustermonkey.net/index.php/Memory for some more info about this (links to LWN articles and a list of supported drivers in 2.6.16). I wasn't aware of the EDAC wiki until I saw it linked from the bluesmoke page just now. It will tell you what chipset support is coming. http://buttersideup.com/edacwiki/ I would be interested to hear about other what kind of single bit error rates other people see on their clusters. -- Andrew Shewmaker
- Previous message: [Beowulf] Stress / torture test cluster hardware
- Next message: [Beowulf] Stress / torture test cluster hardware
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
