[Beowulf] ECC support on motherboards?
Håkon Bugge
Hakon.Bugge at scali.com
Tue May 13 14:16:17 PDT 2008
At 19:17 13.05.2008, Perry E. Metzger wrote:
>So another question is, how can you reliably test any of this stuff?
>It isn't like you can reliably induce single bit errors and see if the
>hardware catches them. (A special memory module that let you test
>would be a wonderful thing, but I've never even heard of such a thing.)
Well, you can trust the HW vs, the firmware.
Further, for some chipsets it is possible to
simply stop the memory refresh for some time
(~1 minute) while the system is idle. After
this, you enable it again, and you should see
single and/or double bit errors. This
enabling/disabling through setpci or other. If
you do not see errors after this, you can try to explain why...
Once I wrote tool which examined all settings of
a particular chipset. That raised numerous questions to the vendor.
Hakon
>I'm doing the planning for a new cluster and the whole thing is
>remarkably bothersome. You can't easily figure out what motherboards
>will even pretend to do ECC that easily, you can't easily check once
>you have a sample motherboard in hand. It isn't even easy to get ECC
>memory for more modern standards. I'm starting to wonder if doing all
>calculations twice, once on each of two machines, isn't easier, but it
>seems utterly wrong to do that...
>
>Perry
--
Håkon Bugge
CTO
mob. +47 92 48 45 14
off. +47 92 44 81 11
fax. +47 22 23 36 66
Hakon.Bugge at scali.com
Skype: hakon_bugge
Scali - http://www.scali.com
Higher Performance Computing
More information about the Beowulf
mailing list