[Beowulf] ECC settings for Opteron 175 + Serverworks HT1000 chipset

Bruce Allen ballen at gravity.phys.uwm.edu
Fri Jan 27 06:13:26 PST 2006

Salut Velu!

> Hey bruce, World is small isn't it ;))

Yup.  (Actually if you look back through the archives you'll find the 
first annoucment of smartmontools was on this list.)

>> [...]
>> I would appreciate advice about:
>>   -- how to configure these settings
>>   -- pointers to relevant AMD/Serverworks documentation
>>   -- relevant Linux kernel options/modules
>>   -- anything else relevant/related

> You cand find some documentation on this project : 
> http://bluesmoke.sourceforge.net/ or the older 
> http://www.anime.net/~goemon/linux-ecc/

I've been corresponding off-list with Mark Langsdorf.  He's an AMD 
employee who works on Linux tools and implementation, hangs out on the 
LKML, and submits kernel patches from AMD. Mark said that the 'bluesmoke' 
functionality is only needed with 2.4 kernels.  With 2.6 kernels you just 
install 'mcelog' and that's everything that's needed.

Mark also said that the mapping between CPUID and chipid needs to be 
correlated with DIMM slot on a case-by-case basis.  One way (which Mark 
does NOT recommend!) is to heat each DIMM with a heat gun, or mask off a 
single bit on the connector, to generate errors from that DIMM.  This 
makes sense for people on this list who will have dozens or hundreds of 
the same box and want to understand this relationship.

> EDAC sounds to be on the way to be integrated upstream 
> (http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=806c35f5057a64d3061ee4e2b1023bf6f6d328e2). 
> This sounds to be some preliminary work but you may give it a try. *I 
> don't know your configuration but the "drivers/edac/amd76x_edac.c" may 
> match. I didn't had time to test EDAC but if you will, I'm interested in 
> your results.

I'll report back to the list whether mcelog is enough, or whether we also 
needed to install other drivers to get ECC reporting.

Mark also provided advice about the other ECC settings.  I'll copy it 
verbatim to the list.  Mark wrote:

You'll want to look at chapter 3 (Memory System) of the BKDG (AMD 64 BIOS 
AND KERNEL DEVELOPERS GUIDE). Here's the recommended settings:

    ECC enable

    MCA DRAM ECC logging

    ECC Chip Kill
                 Enable if using x4 DIMMs

    DRAM Scrub Redirect

    DRAM BG Scrub
                 set as high as possible (84 ms is maximum)

    L2 Cache BG Scrub
                 not DRAM related

    Data Cache BG Scrub
                 not DRAM related

[Note from Bruce: can anyone on the list make recommendations about this 
last two, non-DRAM-related SCRUB settings??]

I also asked Mark:

> Am I correct that there is nothing in the Linux kernel which
> modifies the  machine registers which determine ECC behavior,
> so I have to depend upon the BIOS to initialize/configure
> these registers as I want?

He replied:

As far as I know, it's BIOS set-up only.  Linux tries to avoid
knowing the details of the DRAM set-up, and there's a limit to
how much the OS can modify anyway.  Linux can set bits to
determine what MCEs cause exceptions, but it can't enable the
DRAM scrubber, for example.


More information about the Beowulf mailing list