Motherboard / Benchmark Questions...

Robert G. Brown rgb at
Wed Jun 14 13:29:51 PDT 2000

On Wed, 14 Jun 2000, Dean Waldow wrote:

> We then calculated the number simulations we could run on the
> hypothetical cluster in a day: celeron's systems were the slowest, with
> PIII and PIII duals basically equivalent at about a 25% increase in
> throughput, and lastly (using an estimate for athalons) hypothetically
> came out the highest with an additional ~15% increase in throughput over
> the PIII's.  I am not confident in that estimate but it is interesting
> and would likely be heavily code specific.

A lot of this depends on how cache-local your code is.  From the numbers
you post (presuming you've adjusted for clock speed differences, since
you were comparing Celerons and PIII's at different clocks), it sounds
like the application is very NONlocal -- the larger L2 cache on the PIII
and its faster memory seem to make a significant difference.  If your
application were a bit more local, you would likely see much more nearly
equivalent performance between these two.  The Athalon has a different
(and presumably faster) cache, so it might well outperform the PIII on
moderately nonlocal code.

> On one level, the differences in throughput are not terribly significant
> compared to the increase  I will be able to get on a cluster vs. the
> current machines I have.  Thus, I am left with a few questions that if
> anyone might have comments on that would be great.  If these questions
> may not have as much general interest, I could summarize off-list
> comments later.   
> 1)  Since my tests indicate little difference in throughput for single
> cpu vs. dual cpu nodes, are there other advantages one way or the other
> in using dual vs. a single cpu nodes?

This, too, depends on how memory intensive the applications are.  The
major "weakness" of a dual is that two processors running flat out on
memory access can saturate the memory bus of Intel systems.  If the
program does enough computation per memory access, the memory accesses
will antibunch and your applications will still complete (nearly) twice
as fast on a dual system.  My embarrassingly parallel Monte Carlo code
works like this -- I get nearly perfect scaling on duals as well as
across the cluster.  However, on memory-intensive code performance can
drop off so that it takes (for example) 1.3-1.5x as long to complete a
job on a dual running two jobs.  You still generally get gain relative
to one processor running two jobs, but two separate nodes will be
faster (completing 2 jobs in 1x the single CPU time).

> 2) In the case of the PIII processor, the question seems to be one of a
> mainboard choice which in turn is mostly about chipsets - right?  From
> what I have read...  The two chipsets that seem prevalent are the 440BX
> and the VIA Apollo 133A.  The newer intel chipsets (i8xx) make me a
> little cautious from what I have read though that may be mostly due to
> the i820.  The 440BX boards sound well tested and stable performers but
> on the "older side" without much difference in price.  The VIA Apollo
> 133A seems like it would have advantages if code is benefited by the
> 133FSB and PC133 memory. Does this summary make sense? And are there
> folks successfully using the newer chipsets? :)

I have no comment on stability.  As far as performance goes, since your
application >>seems<< to be fairly memory intensive based on the
celeron-PIII differentiation, the faster memory might well make a
difference.  The only way to know for sure is to test it (or understand
the memory access pattern of your code in detail).  Is your Monte Carlo
algorithm is doing a random site update (and hence jumping all over
memory)?  Is there any way to organize it to operate more locally?

> 3) Since I have not been able to benchmark my code on an athalon, does
> anyone have experience in comparing performance on athalons versus
> PIII's for a real world example?  Or, are the performance differences
> really so code dependent that it is difficult to "generalize."  8-)  The
> potential for increased throughput is tempting but without better
> estimates the stability/certainty of the PIII maybe more important in
> the long run.

The only safe way to compare is to test it.  My own tests of Athalons
with my Monte Carlo code were very disappointing -- I get by far the
best price performance on Celerons, as my code is generally local enough
to run satisfactorily with a 128 K L2 cache (even allowing for slower
memory).  The benchmarks I've run suggest that the Athalon's real
strength is its cache and memory subsystem.  However, your mileage may
vary considerably.

You can "generalize" (perhaps) only after you understand your code and
the things that are determining its effective speed.  As a rule, a CPU
bound process is primarily affected by clock more than anything else.
As a process becomes memory bound, speeds are very nonlinearly affected
by stride and memory access pattern and so forth.  This can all be
understood and guestimated, but it is difficult to predict what the
answers will be for your application without the source code or a
description of the algorithm.

  Hope this helps,


> Thanks for any input and I hope these questions are not too simple...
> Dean W.
> -- 
> -----------------------------------------------------------------------------
> Dean Waldow, Associate Professor      (253) 535-7533 
> Department of Chemistry               (253) 536-5055 (FAX)
> Pacific Lutheran University           waldowda at
> Tacoma, WA  98447   USA     
> -----------------------------------------------------------------------------
> ---> CIRRUS and the Chemistry homepage:         <---
> -----------------------------------------------------------------------------
> _______________________________________________
> Beowulf mailing list
> Beowulf at

Robert G. Brown	             
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at

More information about the Beowulf mailing list