[Beowulf] ECC support on motherboards?

Håkon Bugge Hakon.Bugge at scali.com
Tue May 13 14:16:17 PDT 2008


At 19:17 13.05.2008, Perry E. Metzger wrote:
>So another question is, how can you reliably test any of this stuff?
>It isn't like you can reliably induce single bit errors and see if the
>hardware catches them. (A special memory module that let you test
>would be a wonderful thing, but I've never even heard of such a thing.)

Well, you can trust the HW vs, the firmware. 
Further, for some chipsets it is possible to 
simply stop the memory refresh for some time 
(~1  minute) while the system is idle. After 
this, you enable it again, and you should see 
single and/or double bit errors. This 
enabling/disabling through setpci or other. If 
you do not see errors after this, you can try to explain why...

Once I wrote tool which examined all settings of 
a particular chipset. That raised numerous questions to the vendor.


Hakon


>I'm doing the planning for a new cluster and the whole thing is
>remarkably bothersome. You can't easily figure out what motherboards
>will even pretend to do ECC that easily, you can't easily check once
>you have a sample motherboard in hand. It isn't even easy to get ECC
>memory for more modern standards. I'm starting to wonder if doing all
>calculations twice, once on each of two machines, isn't easier, but it
>seems utterly wrong to do that...
>
>Perry

--
Håkon Bugge
CTO
mob. +47 92 48 45 14
off. +47 92 44 81 11
fax. +47 22 23 36 66
Hakon.Bugge at scali.com
Skype: hakon_bugge

Scali - http://www.scali.com
Higher Performance Computing





More information about the Beowulf mailing list