A few bits from my corner of the experience space:<div><br></div><div>If you have a BMC, 'ipmitool sel list' will probably show the correctable and uncorrectable errors, generally not naming the DIMM involved. But 'ipmitool sel list -v' shows details from various fields in the SEL records. In the ASUS boards I've been playing with lately, the Sensor Number field together with the Event Data field will (usually) tell you the DIMM slot, once you know how to decode those fields for the specific motherboard (and possibly firmware revisions?) that you have.</div>
<div><br></div><div>How do you get that motherboard-specific data? By finding a DIMM that reliably produces errors, and moving it from slot to slot, taking notes on those two SEL fields above. I've seen a similar thing work for Dell machines too.</div>
<div><br></div><div>If you have Dell PowerEdge R or M boxes (or previous generation equivalents), there are various nicer ways to get the name of the DIMM involved, including using a version of ipmitool that has the 'delloem' subcommand.</div>
<div><br></div><div>I second Tony's suggestion that RAM testers may not be as good as real systems, for finding bad RAM. My experience on one large system a few years ago was that new DIMMs failed at a rate of around 1% per year, but "refurbished" DIMMs from RMAs failed at 10% per year (or was it even higher? I forget). I was led to believe that these refurbished DIMMs were often customer returns that had been run through a RAM tester and passed. Turns out sometimes the customers were right and the "refurbishment" process was wrong.</div>
<div><br></div><div>One more thing about the ASUS boards I've been playing with lately: If you get a panic on uncorrectable memory error, and power cycle the system (using the power button, or by remote 'ipmitool ... power cycle'), the following POST does not report the bad DIMM. But if you *reset* the system (by pushing the reset button with a paperclip, or by remote 'ipmitool ... power reset'), the next POST will pause and tell you what CPU, Channel, and DIMM was affected on that previous uncorrectable error, which is more info that 'ipmitool sel list' gives you. It's then up to you to figure out how CPU, Channel, and DIMM map to the silkscreened names on the motherboard -- I couldn't find documentation, but it turned out to be the pattern we suspected. :)</div>
<div><br></div><div>David<br><br></div>