Take any two: motherboard performance, compatibility, value

Don Holmgren djholm at fnal.gov
Wed Jun 28 17:06:33 PDT 2000


On Wed, 28 Jun 2000, Bob Drzyzgula wrote:

...

> > > BTW, I see that ECC corrects about one single bit error per month in
> > > 12GB of RAM.  Our total system will have close to 40GB, so errors could
> > > pop up weekly, which is why we need ECC.  
> > 
> > Are you absolutely certain that ECC RAM on PC hardware actually *corrects*
> > bit errors ?
> > 
> > There was a short discussion on this subject on the linux-kernel list some
> > weeks ago, where someone stated that ECC RAM (for PCs) can only *detect* a
> > parity error and offer you an NMI when that occurs. Noone seemed to object to
> > this.
> 
> The last thing I am is an expert on this, but, quoting
> Intel's 440BX web page at
> 
>   http://developer.intel.com/design/intarch/techinfo/440BX/BX_arch.htm
> 
> ] The Intel® 440BX AGPset also provides DIMM plug-and-play
> ] support via Serial Presence Detect (SPD) mechanism using
> ] the SMBus interface. The 82443BX provides optional
> ] data integrity features including ECC in the memory
> ] array. During reads from DRAM, the 82443BX provides
> ] error checking and correction of the data. The 82443BX
> ] supports multiple-bit error detection and single-bit error
> ] correction when ECC mode is enabled and single/multi-bit
> ] error detection when correction is disabled. During
> ] writes to the DRAM, the 82443BX generates ECC for the
> ] data on a QWord basis. Partial QWord writes require a
> ] read-modify-write cycle when ECC is enabled.
> 
> In these PC architectures, I don't think that there is any
> ECC generation on-module like there is in some architectures,
> there is only sufficient bit storage to allow the chipset
> to generate the somewhat-redundant codes and store those.
> 
> Whether the motherboard manufacturers, BIOS writers and
> operating systems configure the chipset properly to take
> advantage of this, or do anything interesting with any
> information provided by the chipset is another matter
> entirely. I would expect, for example, that the chipset
> would raise some sort of alert if a single-bit ECC error
> was detected and corrected; certainly the OS would want
> to log such an event. Depending on the motherboard, BIOS
> and OS, it would certainly be possible to treat such an
> alert exactly the same as one would treat a double-bit
> error, or a a single-bit error when ECC is turned off,
> e.g. NMI. It's also possible, I suppose, that the ECC
> generation and detection in the 443BX doesn't work worth
> a damn and thus most 440BX designs leave it turned off.
> I have no reason to believe this is true, however.
> 
> FWIW.
> 
> --Bob Drzyzgula

When we ran into some memory problems on 440BX- and 440GX-based systems, I dug
into the Intel PCI chipset manuals and wrote some code to dump the information
from the memory controller registers. 

The extra 8 bits available on memory with parity - 72 bits wide, rather than 64
bits (interesting that this is now marketed as ECC memory; a couple of years ago
it was sold as parity memory) - is indeed used to do the ECC calculations and
corrections by the memory controller.  No additional circuitry in needed on the
DIMMs.  Single bit errors are all corrected transparently to the microprocessor.
Multibit errors are not correctable, and if so configured the chipset can issue
an NMI.  On Linux this NMI results in the "dazed and confused" console message:

  "Uhhuh. NMI received. Dazed and confused, but trying to continue"

We have a critical application which can't tolerate data errors, and so have
patched the NMI trap and reboot the system immediately following a multibit
error.

The memory controller has a couple of registers used to indicate whether bit
errors have been detected - a flag for a single bit error, a flag for a multiple
bit error, and the page where the error occurred.  This information is latched
at the first error.

At ftp://linux-rep.fnal.gov/pub/motherboards/ I have 3 programs you can use to
query the controller:
  chip2.c - dumps lots of information, such as CAS/RAS timings, which DIMM
            slot(s) are populated, how large the DIMMs are, whether each DIMM is
            ECC-capable or not, whether and where bit errors have occurred.
  biterror_check.c - checks and reports whether or not a single or multiple bit
            error has occurred, and the page of the occurrance.  Remember, this
            information is latched, so multiple errors may have occurred
            subsequent to the first.
  biterror_reset.c - checks and reports whether or not a single or multiple bit
            error has occurred, and the page of the occurrance.  Also resets the
            error flags.

On my motherboards there's always a single bit error after a reboot, so I
suspect the BIOS causes one to happen when sizing memory.  So, I usually do a
biterror_reset during system startup.  

On the systems we're currently monitoring - 20 L440GX+ motherboards with 512 MB
of memory each - single bit errors are extremely rare.  Perhaps 1 per month of
operation across all of the machines.  I've not seen a multiple bit error since
replacing memory last January.

To interpret the output of chip2.c you'll need the 82443BX or 82443GX host
bridge manual from Intel.

Don Holmgren
Fermilab





More information about the Beowulf mailing list