Motherboard / Benchmark Questions...

Dean Waldow waldow at rainier.chem.plu.edu
Thu Jun 15 00:08:54 PDT 2000


> > the PIII's.  I am not confident in that estimate but it is interesting
> > and would likely be heavily code specific.
> 
> A lot of this depends on how cache-local your code is.  From the numbers
> you post (presuming you've adjusted for clock speed differences, since
> you were comparing Celerons and PIII's at different clocks), it sounds
> like the application is very NONlocal -- the larger L2 cache on the PIII
> and its faster memory seem to make a significant difference.  If your
> application were a bit more local, you would likely see much more nearly
> equivalent performance between these two.  The Athalon has a different
> (and presumably faster) cache, so it might well outperform the PIII on
> moderately nonlocal code.

Thanks for the comments.  The raw numbers are in the table below and in
our calculation of runs/cluster-day we did correct the clock speed to
the processor in the hypothetical cluster.  So, our numbers had both raw
speed and economic factors in there...  Basically, what would could buy
for our budget.  Here are the raw numbers without the current price data
in the calculation.   

For my monte carlo code: (no overclocking)

system                          one run -->    (min)  norm-> 1GHz (min)
-------------------------------------------------------------------------
celeron 400MHz (128k cache, 128MB ram)         153         61
PIII 550MHz Cu-mine (256K cache 128MB ram)      76         42
  asus p3b-f board with pc100 mem
PIII 600MHz Katmai/dual (512K cache 504 MB ram)      
    1 proc running only                         75         45
    2 concurrent processes on dual              79         47
  asus p2b-d board 

I think it is still the same basic conclusion if I am understanding your
argument...  pretty significant difference between celeron and PIII but
smaller differences between PIII's / dual.
 
> > 1)  Since my tests indicate little difference in throughput for single
> > cpu vs. dual cpu nodes, are there other advantages one way or the other
> > in using dual vs. a single cpu nodes?
> 
> This, too, depends on how memory intensive the applications are.  The
> major "weakness" of a dual is that two processors running flat out on
> memory access can saturate the memory bus of Intel systems.  If the
> program does enough computation per memory access, the memory accesses
> will antibunch and your applications will still complete (nearly) twice
> as fast on a dual system.  My embarrassingly parallel Monte Carlo code
> works like this -- I get nearly perfect scaling on duals as well as
> across the cluster.  However, on memory-intensive code performance can
> drop off so that it takes (for example) 1.3-1.5x as long to complete a
> job on a dual running two jobs.  You still generally get gain relative
> to one processor running two jobs, but two separate nodes will be
> faster (completing 2 jobs in 1x the single CPU time).

I think the dual performance is similar to your experience regarding the
duals given the numbers above though maybe not quite as perfect
scaling... ~5% but i don't think that is too bad.  It just seems when
going through the complete calculation for numbers of nodes given the
budget the throughput doesn't seem to be much different.

> > 133FSB and PC133 memory. Does this summary make sense? And are there
> > folks successfully using the newer chipsets? :)
> 
> I have no comment on stability.  As far as performance goes, since your
> application >>seems<< to be fairly memory intensive based on the
> celeron-PIII differentiation, the faster memory might well make a
> difference.  The only way to know for sure is to test it (or understand
> the memory access pattern of your code in detail).  Is your Monte Carlo
> algorithm is doing a random site update (and hence jumping all over
> memory)?  Is there any way to organize it to operate more locally?

I think you are right in that it seems memory dependent with 128MB being
enough and likely memory speed influenced at the least. The algorithm
does pick a random spot in my 3D lattice and consequently I would say it
does likely jump all over memory.  As to organizing the code to operate
more locally, I don't know a simple way. I would have to really study
the implications to the results to feel confident about that relative
the simulation time savings.   

> > the long run.
> 
> The only safe way to compare is to test it.  My own tests of Athalons
> with my Monte Carlo code were very disappointing -- I get by far the
> best price performance on Celerons, as my code is generally local enough
> to run satisfactorily with a 128 K L2 cache (even allowing for slower
> memory).  The benchmarks I've run suggest that the Athalon's real
> strength is its cache and memory subsystem.  However, your mileage may
> vary considerably.

I hope to have an athlon test in the near future and will be interesting
to see where it falls.  

> You can "generalize" (perhaps) only after you understand your code and
> the things that are determining its effective speed.  As a rule, a CPU
> bound process is primarily affected by clock more than anything else.
> As a process becomes memory bound, speeds are very nonlinearly affected
> by stride and memory access pattern and so forth.  This can all be
> understood and guestimated, but it is difficult to predict what the
> answers will be for your application without the source code or a
> description of the algorithm.

The (non)linearity with clock speed is much more understandable now.  I
also have some benchmarks on a 733MHz PIII but have not been confident
in them yet since I don't know much about the system they were run on
yet and they seemed to be almost the same as the 550MHz/600MHz PIII
tests I did.  If I get confident in that number, it would be consistent
with the memory intensive nature of the code.  The interesting question
then seems to be connected to memory bus speed.  Hence, processor speed
can keep going up but if the memory bus is the limiting factor then you
might not see much difference.  When I then get a benchmark on a system
with faster bus, it might be pretty informative also. 

>   Hope this helps,

It certainly does! And hope it can help others also. Thanks...

> Robert G. Brown                        http://www.phy.duke.edu/~rgb/

Dean W.
-- 
-----------------------------------------------------------------------------
Dean Waldow, Associate Professor      (253) 535-7533 
Department of Chemistry               (253) 536-5055 (FAX)
Pacific Lutheran University           waldowda at plu.edu
Tacoma, WA  98447   USA               http://www.chem.plu.edu/waldow.html
-----------------------------------------------------------------------------
---> CIRRUS and the Chemistry homepage: http://www.chem.plu.edu/         <---
-----------------------------------------------------------------------------




More information about the Beowulf mailing list