[Beowulf] ECC exerciser/exorciser?

Prentice Bisbal prentice at ias.edu
Mon Jan 26 08:00:54 PST 2009


Mark Hahn wrote:
> Hi all,
> we're having some trouble with nodes showing high ECC corrected error (CE)
> counts.  I'm wondering whether you have any wisdom on the following:
> 
> - first, how would you go about setting a threshold for how high is an
> acceptable CE count?  we by default are using the mce module, which by
> default polls at 1Hz.  my thinking is that if we get overflow events
> (the multiple error bit is set), then it's too fast.
> 
> - do you have or know of a good exerciser for testing ECC's?  yes, I
> know about memtest86, but I'm more curious about a load that could be
> run under
> linux.  my thinking is that ecc's are triggered by bad reads, so something
> which allocates all memory and then continually reads it would be best.
> 

Mark,

I find just running a large HPL job across the cluster will find errors
It may take a couple of days, but it will. I've run breakin for days on
end, and not found any memory errors, but when I run a full-blown hpl
job, I find memory errors right away (if right away = a couple of days)

Breakin runs xhpl on every core, but I'm not sure if it's MPI-based, or
if every core is running an independent job. Maybe the breakin
developer(s) can pipe in on how it stresses the RAM.

Hope that helps.

-- 
Prentice



More information about the Beowulf mailing list