[Beowulf] Seeing ECC errors since upgraded from Opteron 246 to 275
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Paulo Afonso Lopes pal at di.fct.unl.ptFri Aug 1 08:40:42 PDT 2008
- Previous message: [Beowulf] Re: Building new cluster - estimate (Ivan Oleynik)
- Next message: [Beowulf] Seeing ECC errors since upgraded from Opteron 246 to 275
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Dear all: Around 2/Apr I removed 2 Opterons 246 and "companion" 4x 512 MB DIMMs from two HPs DL145-G2, leaving them void, to populate other two HPs (got 2 CPUs and 4GB per node). Then, I installed 2 dual-core Opterons per DL145-G2, together with 4 sticks of 1GB (2 sticks per CPU). So I have 2 DL145-G2 nodes with 2 single-core 246 / 4GB each, and 2 DL145-G2 nodes with 2 dual-core 275 / 4GB each. On 18th/Apr, one of the dual-core nodes crashed with an ECC error. From IMPI, for that node, 04/18/2008 | 20:26:26 | Memory #0x02 | Uncorrectable ECC | Asserted 06/18/2008 | 12:00:16 | Memory #0x02 | Uncorrectable ECC | Asserted 06/23/2008 | 11:58:34 | Memory #0x02 | Uncorrectable ECC | Asserted 07/19/2008 | 22:41:12 | Memory #0x02 | Uncorrectable ECC | Asserted 07/22/2008 | 17:18:00 | Memory #0x02 | Uncorrectable ECC | Asserted 07/23/2008 | 22:08:15 | Memory #0x02 | Uncorrectable ECC | Asserted 07/28/2008 | 17:52:23 | Memory #0x02 | Uncorrectable ECC | Asserted On 07/19 the memory of CPU0 was replaced; on the 27th, the remaining memory was replaced. ECC crashes do continue, from 1 per day to 1 per week. 07/28: first ECC error on the other Opteron-275 populated node. 07/28/2008 | 18:54:23 | Memory #0x02 | Uncorrectable ECC | Asserted All nodes have IB boards, and I swapped the boards from the first crashing and second crashing nodes (that's when, a few days later, the second node crashed the very first time). I have observed that not more than 2 minutes away from the ECC there are always these events logged: 06/18/2008 | 11:58:16 | System ACPI Power State #0x01 | S0/G0: working | Asserted 06/18/2008 | 11:58:16 | System ACPI Power State #0x01 | S5/G2: soft-off | Deasserted (but they are logged also at other times) I am running Scientific Linux 5, the (lam) MPI application uses almost 100% CPU and does exchange lots of small packets through IPoIB (I have not used "native" IB yet). "Everything" is 64-bit (kernel, apps). Any thoughts? Best Regards, paulo lopes -- Paulo Afonso Lopes | Tel: +351- 21 294 8536 Departamento de Informática | 294 8300 ext.10763 Faculdade de Ciências e Tecnologia | Fax: +351- 21 294 8541 Universidade Nova de Lisboa | e-mail: pal at di.fct.unl.pt 2829-516 Caparica, PORTUGAL
- Previous message: [Beowulf] Re: Building new cluster - estimate (Ivan Oleynik)
- Next message: [Beowulf] Seeing ECC errors since upgraded from Opteron 246 to 275
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
