[Beowulf] Fwd: H8DMR-82 ECC error
reuti at staff.uni-marburg.de
Mon Aug 1 10:38:30 PDT 2011
on behalf of Jörg I forward this to the list, as his account seems to be blocked to post to this list any longer.
> Dear all,
> as I cannot post directly to the list although I am subscribing to it, I have
> asked a friend of mine to post that for me.
> I am currently having severe problems with one of the clusters I am
> maintaining. Around 50% of these nodes are crashing when we are running cp2k
> on it. Although they are IB nodes, even without the IB card installed the test
> jobs crash the node as well. So I can rule out an IB related problem. Memtest
> was ok, I done 9 cycles without any problems. Unfortunately I cannot swap the
> memory as I don't have any of them at all and hence I have to rely on Memtest
> here. The nodes which are causing the problems show other symptoms as well: I
> had problem with 3 of them to boot again after a normal shutdown procedure
> (the fans come on, and die after a short period and I don't even get to the
> POST stage at all). So they are offline as well. Two of the remaining nodes were
> exceedingly hot after a reboot. When I took them out the fans were spinning
> and now they appear to be ok. These are AMD Opteron 2220 dual core processors
> with 2 CPUs per node. The mother board is a H8DMR-82 with the BIOS version
> 080014 (release date 07/13/2007). It appears that almost always the same nodes
> are crashing with this error message:
> Hardware Error
> CPU0 Machine Check Exception 4 Bank 2 b200200000000863
> TSC 108dd369444
> Processor 2:40f13 Time 1311847912 Socket 0 APIC 0
> MC2-Status: Uncorredted error, report: yes MisV: invalid
> CPU context corrupt: yes UECC Error
> Bud Unit Error: prefetch/ECC error in data read from NB: local node originated
> Transaction type: prefetch (mem access), no timeout, cache level L3/generic.
> Participating Processors: local node originated (SRC)
> Judging from this I would guess there is a memory related problem.
> Given there are a number of people on the list here and they probably have
> seen similar hardware before, do I simply have a bad batch of hardware which
> is known to cause problems or do I have a different issue here? What I am after
> is some kind of idea of where to look next. It is not the compiled program as
> taking out the disc and placing it in a different node (same motherboard, same
> Opteron but slightly different flags) does not cause any problems at all.
> Given the large number of nodes which causing problems, before I am proposing
> to write off these nodes I would like to make sure it is not a subtle issue
> like a BIOS upgrade which could cure the problem.
> Many thanks for your help and all the best from London
> Jörg Saßmannshausen
> University College London
> Department of Chemistry
> Gordon Street
> WC1H 0AJ
> email: j.sassmannshausen at ucl.ac.uk
> web: http://sassy.formativ.net
> Please avoid sending me Word or PowerPoint attachments.
> See http://www.gnu.org/philosophy/no-word-attachments.html
More information about the Beowulf