Prentice,<br><br>You only asked for memory testing programs, but I'm going to go a bit further, to make sure some background issues are covered, and to give you some ideas you might not yet have. Some of this is based on a lot of experience with Dell servers in HPC.<br>
<br><br>Some of my background thoughts on dealing with SBEs:<br><br>1) Complete and details historical records are important for correctly and efficiently resolving these type of errors, especially on larger clusters. Otherwise it's too easy to get confused about what happened when, and come to incorrect conclusions about problems and solutions. Treat it like a lab experiement -- keep a log book or equivalent, test your hypotheses against the data, and think broadly about what alternative hypotheses may exist.<br>
<br>2) The resolution process will be iterative, with physical manipulations (e.g. moving DIMMs among slots) alternating with monitoring for SBEs and optionally running stress applications to attempt to trigger SBEs (a "reproducer" of the SBEs).<br>
<br>3) For efficient resolution, you want a quick, reliable reproducer, something that will trigger the SBEs quickly.<br><br>4) I've seen no evidence that SBEs materially affect performance or correctness on a server, so my practice has often been to leave affected servers in production as much as possible, taking them out of production (after draining jobs) only briefly to move DIMMs, replace DIMMs, etc.<br>
<br>Regarding (4), if anyone here has measurements or a URL to a study saying in what circumstances there's a significant material risk to performance or correctness of calculation with SBE correction, I'd love to see that. I'm not saying that SBE correction is completely free performance wise -- I bet it takes a little time to do the correction, but I bet for normal SBE correction rates, that time is (nearly) unmeasurable.<br>
<br>Also, over a few thousand server-years, I've never or almost never seen SBE corrections morph into uncorrectable multi-bit errors. When uncorrectable errors have shown up (which itself has been rare in my experience, mostly in a single situation where there was a server bug that got corrected), they've shown up early on a server, not after a long period of only seeing SBEs.<br>
<br><br>Prentice, I believe you started this thread because you need something for (3), is that right? As David Mathog said, you already know what activity most reliably triggers SBE corrections: Your users' code. If I were in your shoes, and I had time and/or were concerned around issue (4) above, I'd a) identify which user, code, and specific runs trigger SBEs the most, then b) if possible, work with that user to get a single-node version of a similar run that you could outside production node use, to reproduce and resolve SBEs. I'd then monitor for SBEs in production, and when they occur, drain jobs from those nodes, and take them out of production so I could user that single-node user job to satisfy (2) and (3) above.<br>
<br>If I was in your shoes and was NOTconcerned about (4), I'd simply drain the running job, do a manipulation (2 above), and put the node back into production, waiting for the SBE to recur if it is going to. This is what I've often done.<br>
<br>Or if you have a dev/test cluster, replace the entire production node with a tested, known-good node from the dev/test cluster, then test/fix the SBE server in the context of the dev/test cluster. I've also often done this.<br>
<br>My experience has been that long runs of single-node HPL was the best SBE trigger I ever found. Dell's mpmemory did not do as well. I believe memtest86{,+} also didn't find problems that HPL found, though I didn't test memtest86{,+} as much. It also was not immediately obvious how to gather the memory test results from mpmemory and memtest86{,+}, though it can probably be done, perhaps easily, with a bit of R&D.<br>
<br>But since you've found that HPL does not trigger SBEs as much as your user's code, I think you have a very good pointer that you should do stress tests with your user's code if at all possible.<br><br>If you can share what the stressful app is, and any of the relevant run parameters, that would probably be interesting to folks on this list.<br>
<br>In my experience, usually SBEs are resolved by reseating or replacing the affected DIMM. However it can also be an issue on the motherboard (sockets or traces or something else), or possibly the CPU (because Intel and AMD now both have the memory controllers on-die), or possibly a BIOS issue (if a CPU or memory related parameter isn't set quite optimally by the BIOS you're running; BIOSes may set hardware parameters without your awareness nor ability to tune it yourself).<br>
<br><br>Best practice may be:<br><br>A) swap the DIMM where the SBE occurred with a neighbor that underwent similar stress but did not show any SBEs. Keep a permanent record of which DIMMs you swapped and when, as well as all error messages and their timing.<br>
B) re-stress either in production (if you believe my tentative assertion (in 4 above) that SBE corrections do not materially affect performance nor correctness), or using your reliable reproducer for an amount of time that you know should usually re-trigger the SBE if it is going to recur.<br>
C) assess the results and respond accordingly:<br> 1) if the SBE messages do not recur, then either reseating resolved it, or it's so marginal that you will need to wait longer for it to show up; may as well leave it in production in this case<br>
2) if the SBE messages follow the DIMM when you swapped it with its neighbor, then it's very very likely the DIMM (especially if the SBE occurred quickly upon stressing it, both before and after the DIMM move). Present this evidence to Dell Support and ask them to send you a replacement DIMM. KEEP IN MIND that although the replacement DIMM will usually resolve the issue, it has never before been stressed in your setup, and it's possible for your stress patterns to elicit SBEs even in this replacement DIMM. So if the error recurs in that DIMM slot, it's possible that the replacement DIMM also needs to be replaced. You again need to do a neighbor swap to check whether it really is the replacement DIMM.<br>
3) If the SBE stays with the slot after you did the neighbor swap, take this evidence to Dell Support, and see what they say. I would guess they'd have the motherboard and/or CPU swapped. Alternatively, you may wish (use your best judgment) to gather more data by CAREFULLY! swapping CPUs 1 and 2 in that server and see whether the SBEs follow the CPU or stay with the slot. Just as with DIMMs, it's not unheard of for replacement motherboards and CPUs to also have issues, so don't assume they're perfect -- usually the suitable replacement will resolve the issue fully, but you won't know for sure until you've stressed the system.<br>
<br><br>What model of PowerEdge are these servers?<br><br>PowerEdge systems keep a history of the messages that get printed on the LCD in the System Event Log (SEL), in older days also called the ESM log (embedded systems management, I believe). The SEL is maintained by the BMC or iDRAC. I believe the message you report below (SBE logging disable) will be in the SEL. I know the SEL logs messages that indicate that the SBE correction rate has exceeded two successive thresholds (warning and critical).<br>
<br>You can read and clear the SEL using a number of different methods. I'm sure DSET does it. You can also do it with ipmitool, omreport (part of the OpenManage Server Administration (OMSA) tools for the Linux command line), and during POST by hitting Ctrl-E (I think) to get into the BMC or iDRAC POST utility. I'm sure there are other ways; these are the ones I've found useful.<br>
<br><br>Normal, non-Dell-specific ipmitool will print the SEL records using 'ipmitool sel list', but it does not have the lookup tables and algorithms needed to tell you the name of the affected DIMM on Dell servers. You can also do 'ipmitool sel list -v', which will dump raw field values for each SEL record, and you can decode those raw values to figure out the affected DIMM -- with enough examples (and comparing e.g. to theDIMM names in the Ctrl-E POST SEL view), you might be able to figure out the decoding algorithm on your own, or google might give you someone who has already figured out the decoding for your specific PowerEdge model.<br>
<br>That is the downside of using standard ipmitool. The upside of ipmitool, though, is that it's quite lightweight, and can be used both on localhost and across the network (using IPMI over LAN, if you have it configured appropriately).<br>
<br><br>The good news is that there's a Dell-specific version of ipmitool available, which adds some Dell-specific capabilities, including to decode DIMM names. This works at least for current PowerEdge R and M servers, as well as older PowerEdge models like the 1950, and probably a few generations older than that. I think it simply supports all models that the corresponding version of OpenManage supports; this does not include older SC servers or current C servers. If you have a model that OpenManage does not support, it may be worth trying, in case it does the right thing for you.<br>
<br>You can get the 'delloem' version of ipmitool from the OpenManage Management Station package. The current latest URL is <br><br><a href="ftp://ftp.dell.com/sysman/OM-MgmtStat-Dell-Web-LX-6.4.0-1401_A01.tar.gz">ftp://ftp.dell.com/sysman/OM-MgmtStat-Dell-Web-LX-6.4.0-1401_A01.tar.gz</a><br>
<br>Then unpack it and look in ./linux/bmc/ipmitool/ for your OS or a compatible one.<br><br>For example, looking in the RHEL5_x86_64 subdirectory, the rpm OpenIPMI-tools-2.0.16-99.dell.1.99.1.el5.x86_64.rpm has /usr/bin/ipmitool with 'delloem' appearing as a string internally. (I'm not able to test it right now.)<br>
<br>Once you've installed the appropriate package, do 'ipmitool delloem'; this should tell you what the secondary options are. I believe 'ipmitool delloem sel' will decode the SEL including the correct DIMM names.<br>
<br><br>If you install OpenManage appropriately, you can also get the SEL decoded, as well as get alerts automatically and immediately sent to syslog. The command line to print a decoded SEL is 'omreport system esmlog'. OpenManage is pretty heavy-weight, though. Some people do install it and leave it running on HPC compute nodes; some people would never do that on a production node.<br>
<br>Your mention of getting log messages about the SBEs makes me think you do have OMSA installed and its daemons running -- is that correct? Try 'omreport system esmlog' if so.<br><br><br>Finally, during POST Ctrl-E at the prompted moment will get you into the BMC or iDRAC POST menu system, in which you can view and optionally clear the SEL. I do not think this is easily scriptable, but if all else fails, that is one way to view the SEL, with proper decoding.<br>
<br><br>I know that's long, and I hope that helps you and possibly others.<br><br>David<br><br><div class="gmail_quote">On Thu, Dec 9, 2010 at 1:54 PM, Prentice Bisbal <span dir="ltr"><<a href="mailto:prentice@ias.edu">prentice@ias.edu</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><div class="im">Jon Forrest wrote:<br>
> On 12/9/2010 8:08 AM, Prentice Bisbal wrote:<br>
><br>
>> So far, mprime appears to be working. I was able to trigger an SBE in 21<br>
>> hours the first time I ran it. I plan on running it repeatedly for the<br>
>> next few days to see how well it can repeat finding errors.<br>
><br>
> After it finds an error how do you<br>
> figure out which memory module to<br>
> replace?<br>
><br>
<br>
</div>The LCD display on the front of the server tells me, with a message like<br>
this:<br>
<br>
"SBE logging disabled on DIMM C3. Reseat DIMM"<br>
<br>
I can also generate a report with DELL DSET that shows me a similar<br>
other message. I'm sure there are other tools, but I usually have to<br>
create a DSET report to send to Dell, anyway.<br>
<br>
--<br>
<font color="#888888">Prentice<br>
</font></blockquote></div><br>