[Beowulf] Seeing ECC errors since upgraded from Opteron 246 to 275
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Jason Clinton jclinton at advancedclustering.comWed Aug 6 12:56:51 PDT 2008
- Previous message: [Beowulf] Seeing ECC errors since upgraded from Opteron 246 to 275
- Next message: [Beowulf] Seeing ECC errors since upgraded from Opteron 246 to 275
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Sat, Aug 2, 2008 at 6:57 AM, Paulo Afonso Lopes <pal at di.fct.unl.pt> wrote: > Thanks, Mark > >>> So I have 2 DL145-G2 nodes with 2 single-core 246 / 4GB each, and 2 >>> DL145-G2 nodes with 2 dual-core 275 / 4GB each. >> >> it's worth making sure you have current bios installed. >> > Not the latest, but the previous; according to "Fixes" just a single, > unrelated fix. Anyway I'm upgrading it... >> >>> 07/28/2008 | 17:52:23 | Memory #0x02 | Uncorrectable ECC | Asserted >> >> it may also be useful to run mcelog, which will tell you about >> any ongoing _correctable_ ECC activity. > > No output in any of the 4 hosts; tried with/without --k8, --dmi, etc. We have a tool on our website called "breakin" that is Linux 2.6.25.9 patched with K8 and K10f Opteron EDAC reporting facilities. It can usually find and identify failed RAM in fifteen minutes (two hours at most). The EDAC patches to the kernel aren't that great about naming the correct memory rank, though. Make sure you have multibit (sometimes says 4-bit) ECC enabled in your BIOS. http://www.advancedclustering.com/software/breakin.html
- Previous message: [Beowulf] Seeing ECC errors since upgraded from Opteron 246 to 275
- Next message: [Beowulf] Seeing ECC errors since upgraded from Opteron 246 to 275
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
