[Beowulf] Seeing ECC errors since upgraded from Opteron 246 to 275
Paulo Afonso Lopes
pal at di.fct.unl.pt
Fri Aug 1 08:40:42 PDT 2008
Dear all:
Around 2/Apr I removed 2 Opterons 246 and "companion" 4x 512 MB DIMMs from
two HPs DL145-G2, leaving them void, to populate other two HPs (got 2 CPUs
and 4GB per node).
Then, I installed 2 dual-core Opterons per DL145-G2, together with 4
sticks of 1GB (2 sticks per CPU).
So I have 2 DL145-G2 nodes with 2 single-core 246 / 4GB each, and 2
DL145-G2 nodes with 2 dual-core 275 / 4GB each.
On 18th/Apr, one of the dual-core nodes crashed with an ECC error. From
IMPI, for that node,
04/18/2008 | 20:26:26 | Memory #0x02 | Uncorrectable ECC | Asserted
06/18/2008 | 12:00:16 | Memory #0x02 | Uncorrectable ECC | Asserted
06/23/2008 | 11:58:34 | Memory #0x02 | Uncorrectable ECC | Asserted
07/19/2008 | 22:41:12 | Memory #0x02 | Uncorrectable ECC | Asserted
07/22/2008 | 17:18:00 | Memory #0x02 | Uncorrectable ECC | Asserted
07/23/2008 | 22:08:15 | Memory #0x02 | Uncorrectable ECC | Asserted
07/28/2008 | 17:52:23 | Memory #0x02 | Uncorrectable ECC | Asserted
On 07/19 the memory of CPU0 was replaced; on the 27th, the remaining
memory was replaced. ECC crashes do continue, from 1 per day to 1 per
week.
07/28: first ECC error on the other Opteron-275 populated node.
07/28/2008 | 18:54:23 | Memory #0x02 | Uncorrectable ECC | Asserted
All nodes have IB boards, and I swapped the boards from the first crashing
and second crashing nodes (that's when, a few days later, the second node
crashed the very first time).
I have observed that not more than 2 minutes away from the ECC there are
always these events logged:
06/18/2008 | 11:58:16 | System ACPI Power State #0x01 | S0/G0: working |
Asserted
06/18/2008 | 11:58:16 | System ACPI Power State #0x01 | S5/G2: soft-off |
Deasserted
(but they are logged also at other times)
I am running Scientific Linux 5, the (lam) MPI application uses almost
100% CPU and does exchange lots of small packets through IPoIB (I have not
used "native" IB yet). "Everything" is 64-bit (kernel, apps).
Any thoughts?
Best Regards,
paulo lopes
--
Paulo Afonso Lopes | Tel: +351- 21 294 8536
Departamento de Informática | 294 8300 ext.10763
Faculdade de Ciências e Tecnologia | Fax: +351- 21 294 8541
Universidade Nova de Lisboa | e-mail: pal at di.fct.unl.pt
2829-516 Caparica, PORTUGAL
More information about the Beowulf
mailing list