Burn-in Utilities

Velocet math at velocet.ca
Wed Apr 24 08:30:08 PDT 2002


On Wed, Apr 24, 2002 at 10:45:19AM -0400, Justin Nemmers's all...
> All:
> 	I am in search of a utility that will allow me to burn-in a 
> new PC.  Ideally, it would peg the procs at 100% as well as exercise 
> the memory (as much as 2Gb/Node.  I know there is a Sun provided 
> utility to do this on Sparc systems, but does anyone have a 
> suggestion for a linux-based (perl would work, too) that will do the 
> same thing?

The packages (in debian and redhat AFAIK) cpuburn and memtest will do
you nicely.

We run 5 odd of each of burnMMX burnK7 and memtest on our athlon machines for
2-3 days and see if even one crashes. We've had a crash on machines tested
AFTER being in service with no problems for 3-4 months. So its definitely a
hardcore excercise.  Oh we also stick dnetc on them on top of all that just to
make sure its hurting.  I think they're set to generate the most heat
possible in the CPU during operation. They definitely draw the most current -
when we were first setting up our cluster and werent sure of power draw,
8 dual 1.333Ghz athlon boards (no drives) would run G98 fine on a 15
amp circuit - as soon as we ran burnMMX/k7 we'd blow breakers.

We run 5-10 to get a nice high context switch going and excercise the OS as
well ;) We (through trial and error) found that running only 1 each of
burnMMX/burnK7 at a time will often not crash for days, whereas running 5-10
will.

(In fact, we only consider a crash within 12 hours to be a reason to RMA it if
its slated for a workstation running windows.  12 hours of that test is almost
equivalent to a crash every 3-6 months of regular LINUX desktop use (and with
windows how can you tell? :))

Its actually suprising how well you can measure the quality of boards that
way. Out of 40 246x Tyan boards we found one bad stick of ram and 0 cpus and
boards bad using this method. However with ECS K75As we found 1/10 boards as
shipped to us would die in 1-6 hours under this load, and another 1/10 will
die within the 2-3 days. while ! burnMMX; do RMA_via_VAR; done

Nonetheless we've never seen every unit of a certain brand always crash within
that time - eventually we get good boards - so using proper sorting after
testing in this manner you can always end up with a set of good boards (at
least as far as these tests are concerned). So far with any board that makes
it past 2-3 days of this we've never seen a problem with Gaussian98, Gromacs
or distributed-net afterwards (at least until we hit long term electron
migration path problems due to regular CPU heat wear and tear...) but none of
our boards/CPUs (the PcChips M817 LMRs are hitting 16 months of continuous
operation) are there yet.

/kc


> 
> Cheers,
> Justin
> -- 
> 
> System Administrator
> National Institutes of Health
> Center for Information Technology
> 9000 Rockville PK
> Building 12B 2N/207
> Bethesda, MD 20892-5680
> 301.496.0396
> http://biowulf.nih.gov
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Ken Chase, math at velocet.ca  *  Velocet Communications Inc.  *  Toronto, CANADA 



More information about the Beowulf mailing list