Athlon SDR/DDR stats for *specific* gaussian98 jobs

Robert G. Brown rgb at phy.duke.edu
Wed May 2 15:16:51 PDT 2001


On Wed, 2 May 2001, Velocet wrote:

> > does ATLAS include prefetching?  it's fairly astonishing how big a
> > difference prefetching (and movntq) can make on duron/athlon code.
> > for an extreme case (Arjan van de Ven's optimized page-copy and -zero):
>
> Im not too up on the internals of atlas. Others on the list probably
> are.

IIRC, somebody on the list (Josip Loncaric?) inserted prefetching into
at least parts of ATLAS for use with athlons back when they were first
released.  It apparently made a quite significant difference in
performance.

I'm sure Google would turn up the discussion and the patches (if he or
whoever isn't listening).

> Because I've mentioned that "perhaps buying an assload of Durons instead
> of dual Athlon DDR boards would give more bang for the buck overall"
> caused some consternation. Most were regarding increased switch costs
> for having so many more nodes, or possibly the cost of havving a boot
> hardrive for each box (which would increase the costs quite heavily).

No need for a jihad on this -- I don't thing anybody would really expect
a dual to truly deliver the same performance (per CPU-memory channel) as
a single, even with nominally doubled memory speed.  So you're probably
right.  There are, I'm sure, plenty of people in the marginal area where
it makes sense to go dual, just as I'm sure there are plenty of people
in the marginal area where it makes sense to go single.  It may not be
very easy to tell which >>one<< is truly cost optimal, though, without
benchmarking your particular code and doing a careful cost comparison
including hidden costs (e.g. the fact that electricity costs and space
costs may be 60% higher for lots of singles, each single requires a
case, its own memory and copy of the OS and network card, and so forth).
In many cases a dual is only 0.7-0.8 the cost of two singles, although
the high cost of DDR makes that unlikely in this case.  Still, it is
under $1/MB, which isn't all THAT bad -- PC133 cost that much only
months ago.  Months from now DDR may cost little more than SDRAM in
equivalent amounts.

The limited benchmarking I've done of DDR-based systems suggests that
DDR can very easily be worth it for folks doing lots of streaming vector
operations that are memory bandwidth bound.  No surprise there.  For
folks that run code with either a lousy rotten random irregular stride
or memory access pattern, OR for folks that run e.g. ATLAS-style
optimized code (that rearrange the problems so that the algorithm runs
out of cache when possible and then fills the cache in single bursts --
there are some nice white papers on the ATLAS site at netlib.org if you
want to see how it works) there are smaller advantages.  If cache works,
it hides memory access speeds.  I actually don't know whether a random
access pattern is better or worse on DDR (yet) -- it has higher
bandwidth but I would guess the latency is no better or even a bit
worse.

I will be getting a DDR-equipped 1.33 GHz Tbird in about ten days, at
which time I'll crunch through a bunch of benchmarks and post the
results.  I got it because I have an application that does SOME stuff
that is likely to be CPU/memory bound, while other tasks are easily
parallelized and less memory bandwidth sensitive (so my nodes are
"regular" PC133).  I think this will be a decent architecture for this
particular task -- showing that even mixed memory architectures can be
cost optimal.

> My advantages that probably apply to few others:
>
> - I dont need any high speed paralellism for this cluster. all jobs run
>   singly by themselves on one node.
>
> - quality isnt even a massive factor - if a board crashes, the job is
>   rescheduled.  Obviously there's an acceptable threshhold, and we're
>   way beyond that - two boards running for 12 days didnt crash on either
>   OS. I am not putting together a mini prototype to start before the shipment
>   of the rest of the parts comes and I finalize the cabinet design: Im gonna
>   run 10 boards within a few inches of eachother in a stack (with proper
>   cooling) and see how they fare - Im mainly curious about RF interference
>   and cooling problems. I am pretty sure they'll be fine (a friend ran
>   4 of these boards even closer than I plan for 4 weeks with no prbolems).
>
> - we are running diskless nodes over NFS and we dont read or write to it
>   often. (256kbps/node average during calculations)
>
> So I dont need a big switch, I dont need super reliable BrandName equipment
> with a service contract and I dont even need high performance network
> cards.

Sounds like you are dead right on all of this.  Embarrassingly parallel
jobs, few to no communications, purely CPU bound -- Durons (or whatever
currently delivers the most raw flops for the least money) are likely to
be perfect for you.  And for many others, actually.  For a long time I
like Celerons (or even dual Celerons) for the same reasons, although at
this point I've converted to AMD-based systems as their cost-benefit has
overwhelmed Intel's whole product line for my code.

> Im actually somewhat interested in the power usage stats for different
> Athlon systems. Having fewer but faster nodes may well not save all
> that much power. Someone have wattage stats handy?

Not yet, but maybe soon.  The fast Tbirds do require a big "certified"
power supply, but I'm guessing they draw a lot less than they "require"
except maybe in bursts.  I'm betting they draw around 100-150W running,
a number that recently got some support on the list.

>
> > bear a price-premium that decreases their speed/cost merit, and
> > that freebsd's page coloring sometimes has a measurable benefit.
>
> Wonder how many people are using FreeBSD on their clusters instead of
> Linux...

Dunno.  Based on what I've seen, heard, and talked to people about it is
a small fraction of the total, but not a small number.  The core effort
has been linux-centric from the beginning, although of course a lot of
cluster stuff runs fine under BSD (or any *nix).

> > I'm dubious about further interpretation, though.  for instance,
> > you seem to show a significant benefit to tbird's larger cache
> > (384 vs 192K), but surely you chose this workload to be bandwidth
> > intensive, didn't you?  if not, then the DDR comparison is rather
> > specious...
>
> I didnt. I chose it to be related to what we need the cluster for. Im
> trying to justify my design because it may come into question. Which is
> partly why I really need to check out a P4s stats, but Im pretty sure
> the price/performance is going to be lower than we can afford. The only
> question Im really trying to head off is why we didnt use the fastest
> Tbirds available and DDR ram.
>
> I am not going to read the G98 code, its horrid spaghetti ;) and my
> fortran isnt that great. And there's A LOT of code. Its not worth my
> time. Its much faster to just run the jobs on different boards and
> see what the results are, than to predict them by reading the code.
>
> Actually I have about 3x as many numbers for non-Atlas jobs, but they're
> kind of useless. However, they do indicate the speedup provided by
> the Thunderbirds as I managed to get both a Tbird and Duron 750, 800 and
> 850. I can dig up those stats if people care, but then again these are
> stats for *my* particular jobs running on non-optimal (non ATLAS)
> gaussian.
>
> > thanks for posting the numbers!
>
> No problem, sorry it wasnt more professionally done ;) I also apologize
> for not running standard G98 tests (Im not aware what would constitute
> such, or if there's a preset package of benchmarks available).

Not a preset package, but I'm trying to start a collection of sorts:

  http://www.phy.duke.edu/brahma/dual_athlon/tests.html

My primary recommendation would be to use lmbench.  It gives you a very
nice set of all sorts of microbenchmarks to profile overall system
performance.  Its packaging could be improved.  Stream-like benchmarks
(including cpu-rate) will give you some idea of float performance
relative to memory access speed.  Stream-2 should let you make a profile
that relates float speed to the size of the memory segment being worked
through (although I haven't tried it yet); cpu-rate definitely does.  A
mixed float/int benchmark that is CPU bound and has no particularly nice
stride or memory access pattern can help you assess overall performance
when the code isn't so nice -- this is what I use my MC code for,
although it isn't really packaged for production.

Hope this all helps or is interesting.  I'm very interested in Athlon
performance profiles as they seem to be the current
most-CPU-for-the-least-money winners, and when one buys in bulk (as
beowulf humans tend to do) this sort of optimization really matters.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu







More information about the Beowulf mailing list