[Beowulf] Fwd: H8DMR-82 ECC error

Reuti reuti at staff.uni-marburg.de
Mon Aug 1 10:38:30 PDT 2011

Hi all,

on behalf of Jörg I forward this to the list, as his account seems to be blocked to post to this list any longer.

-- Reuti

> #############
> Dear all,
> as I cannot post directly to the list although I am subscribing to it, I have 
> asked a friend of mine to post that for me.
> I am currently having severe problems with one of the clusters I am 
> maintaining. Around 50% of these nodes are crashing when we are running cp2k 
> on it. Although they are IB nodes, even without the IB card installed the test 
> jobs crash the node as well. So I can rule out an IB related problem. Memtest 
> was ok, I done 9 cycles without any problems. Unfortunately I cannot swap the 
> memory as I don't have any of them at all and hence I have to rely on Memtest 
> here. The nodes which are causing the problems show other symptoms as well: I 
> had problem with 3 of them to boot again after a normal shutdown procedure 
> (the fans come on, and die after a short period and I don't even get to the 
> POST stage at all). So they are offline as well. Two of the remaining nodes were 
> exceedingly hot after a reboot. When I took them out the fans were spinning 
> and now they appear to be ok. These are AMD Opteron 2220 dual core processors 
> with 2 CPUs per node. The mother board is a H8DMR-82 with the BIOS version 
> 080014 (release date 07/13/2007). It appears that almost always the same nodes 
> are crashing with this error message:
> Hardware Error
> CPU0 Machine Check Exception  4 Bank 2 b200200000000863
> TSC 108dd369444
> Processor 2:40f13 Time 1311847912 Socket 0 APIC 0
> MC2-Status: Uncorredted error, report: yes MisV: invalid
> CPU context corrupt: yes UECC Error
> Bud Unit Error: prefetch/ECC error in data read from NB: local node originated 
> (SRC)
> Transaction type: prefetch (mem access), no timeout, cache level L3/generic. 
> Participating Processors: local node originated (SRC)
> Judging from this I would guess there is a memory related problem.
> Given there are a number of people on the list here and they probably have 
> seen similar hardware before, do I simply have a bad batch of hardware which 
> is known to cause problems or do I have a different issue here? What I am after 
> is some kind of idea of where to look next. It is not the compiled program as 
> taking out the disc and placing it in a different node (same motherboard, same 
> Opteron but slightly different flags) does not cause any problems at all.
> Given the large number of nodes which causing problems, before I am proposing 
> to write off these nodes I would like to make sure it is not a subtle issue 
> like a BIOS upgrade which could cure the problem.
> Many thanks for your help and all the best from London
> Jörg
> ##############
> -- 
> *************************************************************
> Jörg Saßmannshausen
> University College London
> Department of Chemistry
> Gordon Street
> London
> WC1H 0AJ 
> email: j.sassmannshausen at ucl.ac.uk
> web: http://sassy.formativ.net
> Please avoid sending me Word or PowerPoint attachments.
> See http://www.gnu.org/philosophy/no-word-attachments.html

More information about the Beowulf mailing list