Athlon vs Pentium III

Tue Feb 20 07:59:08 PST 2001

On Fri, 16 Feb 2001, Eric Hoyt wrote:

> Hi,
>
> I'm in the initial phase of putting together a 24 node cluster.  The
> specifics of what it will be used for aren't fully developed (ie what
> applications will be run), but its main purpose will be optical
> simulations.  One issue I am faced with is the choice in CPU.  I know the
> response to "What hardware should I choose?" is always application
> specific, so I'm just looking for general pointers and any issues people
> may have encountered that they would be willing to share.
>
> Anyway, here is what it has been narrowed down to: I can put together a
> cluster of 24 Thunderbird 850s, or 12 dual PIIIs at 1GHz (24 total
> CPUs) for around the same amount.  Both setups have 256MB of memory per
> CPU.  What do you consider when making a decision like this? What
> scenarios favor Athlons, and what scenarios favor PIIIs?  Also, I have
> heard advice to go with classic Athlons instead of Thunderbirds, due to
> 512k off die vs 256k on die cache issues.  Again, what would make you lean
> one way or the other?
>
> Again, I know that the ultimate decision will depend on what we'll be
> running, but any general tips or advice would be greatly appreciated.

Since you already know that the best choice will be one specific to your
application(s), you are well on the way to figuring out the best choice.
I concur with your price comparison.  A Tbird 1 GHz node with 256 MB
costs just about exactly $500 diskless, $600 with 10 GB disk over the
counter (case, mobo, cpu, memory, NIC, floppy, build it yourself).  In a
volume purchase you could probably get enough knocked off to pay for a
cheap video card and heavy duty shelving to rack them all up and for a
total 24 node cost of $12K to $15K.  As you have noted, with careful
shopping you can get a dual PIII node for $1000-1200 (build it yourself
price) and rack up 24 processors that way.  The GHz PIII is almost twice
as expensive as the GHz Tbird, but you have to buy only one motherboard,
one case, one NIC, one disk.

The first thing to do is (if possible) get one of each to play with --
beg, borrow, buy, steal if necessary.  The winner will become a node.
The loser will cover a desktop or become a server for the cluster.  I
found some very nifty motherboards with built in RAID controllers, for
example, that would probably make nice servers.

This SHOULD let you prototype your application(s) on the two candidates
and get the absolute best answer to your question without wasting any
money.  If this isn't possible (and frequently it isn't -- you may be
doing all this work to write a proposal for the money to buy the cluster
and not have any money at all for prototyping hardware or any friends
nearby that you can bum a run off of) then you have to work on the
second best answer.

I'm going to assume that you have SOME sort of ix86 instruction set
system to work on -- an Athlon or P6-family box.  The things you have to
determine are whether your application is:

  memory bound
  CPU bound
  disk I/O bound
  network I/O bound (in its parallel version)
  cache-size sensitive (tied into memory/cpu bottlenecks, usually)

and possibly even the relative weights of integer instructions, floating
point instructions, trancendental/math library calls, and systems calls
in the code mix.  Note that these are not exclusive -- a program can be
both memory and CPU bound, for example, either in completely different
(but important) parts of the code or in different phases of a single
part.  There are also strong nonlinear interactions between all of the
latencies and bandwidths associated with these bottlenecks -- shifting
e.g. the size of the application can easily move you from one regime to
another across all sorts of interesting ground in between.

A lot of this can be roughly determined (at least in relative terms) by
compiling with the -p (profile) flag and applying gprof.  Read e.g. man
gprof to get an idea of what it is and what it does.  Remember that a
Heisenberg Uncertaintly principle is at work here, though -- the
absolute times returned by a profiler won't be precisely right because
of the extra overhead introduced by the use of the profiler itself.  It
will usually give you the RELATIVE times fairly reasonably, though --
especially if (as is usually the case) your program spends most of its
time in one or two core loops doing just a handful of very repetitive
things.

A lot of it can also be determined by just looking over your code and
seeing how its execution time scales when you increase the "size" of the
run (if it has one or more "size" parameters -- most code does but not
all).  It is a good idea to try to profile or analyze your code mix for
a variety of size (or other control parameter) values to get some idea
of how its rate-determining operations vary at those different scales.

Once you understand your code pretty well, THEN you can proceed to make
an intelligent choice of system.  Here are some things that will bear.

  a) Dual CPU systems oversubscribe the memory bus.  Consequently, if
your task is large and very memory intensive (in fact, memory bus bound)
it will NOT run twice as fast on a dual -- 1.2-1.7x is a more likely
range, depending on the details of just how memory bus bound your
application is and how expensive/fancy the memory subsystem is on the
motherboard you select.

  b) Cache "works" for some programs and doesn't for others.  That is,
for the memory access pattern of some programs, (say) 90% of the memory
accesses come out of cache and only 10% come from main memory.  If cache
works for your program, chances are you don't give a rat's ass about
memory bus speed (in that you won't see much if any difference in
overall execution speed with faster memory).  If your prototyping system
has an adjustable memory bus clock and your system still runs with two
or more of the settings, see if altering the setting makes any
difference at all in the completion time of your application.  If not,
then you're probably CPU bound, cache works, and you can ignore the
memory bus speed altogether.  Your program will PROBABLY run just as
fast on a dual as on a single even if two instances are running at the
same time, as one will tend to use the memory bus while the other is
running out of cache.

  c) You need overall to think about sharing of resources on a dual.  A
dual can also easily bottleneck on (e.g.) the disk or the network, if
either are used intensively during your program (parallelized version or
not).  Indeed, if the application is rich in any kind of system calls,
you have to think about both the efficiency and stability of the SMP
kernels relative to the UP kernels.  I've used both for years -- the smp
kernels are generally pretty good for numerical work but I also cannot
deny that they are less stable for desktop operations, where some of the
drivers you might want for nifty hardware may not be SMP-safe (or come
with an SMP version of the drivers at all).  This can work in favor of a
dual as easily as against it -- a dual has two cpu's with which to
spread out certain shared loads, and favorable execution patterns can
develop that actually speed program execution by avoiding certain
blocks.

  d) Finally, different CPUs execute different kinds of instruction
mixes differently.  Not too deep, but very important to remember.  The
"cpu-rate" microbenchmark I've been working on moderately seriously for
a year or so now shows the Athlon 800 significantly faster than a 933
MHz PIII, both in L1 and L2 cache and running out of main memory.  The
Athlon (apparently) has a better memory subsystem and the benchmark thus
shows better vector-type float performance all the way out into main
memory.  By significantly better I mean just about exactly twice as fast
at equivalent clock for a clean vector of pure arithmetic instructions.

HOWEVER (before you take that result too seriously) they evaluate
trancendentals at just about the same rate (corrected for clock).  Also,
when I run my own personal Monte Carlo application/benchmark on the 933
MHz PIII and the 800 MHz Tbird immediately available to me, it runs at
so very very nearly the same rate (corrected for clock) that there is
absolutely no reason to choose one over the other except cost.  My very
own cpu-rate benchmark is not an effective predictor of performance for
my own favorite application (which is NOT particularly vectorized and is
heavy in trancendental calls and indexing/integer arithmetic).

Interestingly, when I run this application two at a time on an
RDRAM-equipped dual 933 PIII (that is, one instance per cpu for both
CPUs), it runs about 5-10% FASTER than it does when running one at a
time on the same system or on my SDRAM-equipped 933 PIII system.

This is all a rather chaotic and meandering way of showing that the
Universe of computer architectures is sufficiently insane that
prototyping is the only rational approach to take if you really MUST get
the right answer.  Hence my suggestion that you at least consider
committing a felony rather than proceed without it.

Otherwise you can try to analyze the cost/benefit as outlined above
(profiling, researching, browsing SPEC and so forth), with a perhaps 20%
chance of picking wrong unless your application is "just like" a
component of spec or some other benchmark you happen on.

Finally, you can consider the following.  As a general rule in
computers, one head is better than two.  All things being equal, you
want to minimize the chance of hitting a bottleneck.  Any resource is a
potential bottleneck.  Sharing the resource between two CPUs simply
guarantees that IF such a single CPU bottleneck ever emerges in your
application mix, you're toast (or have to spend more money to try to
overcome it -- adding additional NIC's, disks and so forth).  If the
bottleneck is down low (memory bus or PCI bus) then even money won't
help -- you cannot avoid sharing the same memory bus or PCI bus on a
dual, and if one CPU can saturate either one the other has to wait its
turn if they both require the resource at the same time.

This general rule isn't much worse that going through all the analysis
above -- it might raise the chances of choosing poorly to 25% because
there are some circumstances where a dual is actually a bit better
choice -- but it is a lot easier to follow and makes your nodes easier
to eventually recycle onto desktops.  I personally am shifting away from
duals (PIII or Celeron) in favor of the single Athlons because of the
bottleneck issues and because the Tbirds seem to generally perform no
worse than the PIII's on any tests I can run and sometimes they perform
spectacularly better.  Hence I think of the PIII as being significantly
overpriced.

BTW, the Dual Athlon is coming very soon now, and may add a third choice
that comes in a clear $100/cpu cheaper than the other solutions if your
code DOES do well on a dual.  If you are indeed writing a proposal with
months to go before you'll actually purchase, you may be able to change
over at that time to something completely different anyway.

    rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu