[Beowulf] MPI application benchmarks

Mon May 7 14:59:18 PDT 2007

On Mon, 7 May 2007, Bill Rankin wrote:

>
> Toon Knapen wrote:
>> Mark Hahn wrote:
>>> sure.  the suggestion is only useful if the cluster is dedicated to a 
>>> single purpose or two.  for anything else, I really think that 
>>> microbenchmarks are the only way to go.
>
> I'm not sure that I agree with this - there are just so many different micro 
> benchmarks that I would worry that relying upon them for anything other than 
> basic system validation (which they are very good at) leaves the potential 
> for some very big holes in your requirements.  Especially in a general 
> purpose system like the one proposed.

I think that there is something to be said for both.  The standard
answer for how to prototype systems for use in a single or few purpose
cluster is "with your applications" to be sure, but general purpose
clusters are a different ball of wax because they inevitably involve
cost-benefit compromises.  Remember, one has to optimize design between
relatively few, very fast, very large memory, very expensive network
nodes (suitable for tightly coupled fine-grained parallel stuff) where
on might well spend MORE on just network and memory than on "the
computer" (everything else) and a much larger stack of utterly
disposable boxes on pretty much any old network with a base 512 MB of
memory (which is more than you need but it is difficult to get any more)
to run very simple EP applications.  [Leaving out clusters with complex
or high speed storage requirements, which basically can double the high
end cost again or close to it.]

When computing the ecomonics of the compromises involved when you don't
even know for sure what KIND of applications that will be running, or
are pretty sure they range from (say) 70% EP or very coarse grained to
30% fine grained, with only the most generic idea of the needs of the
fine grained applications (to what extent are they latency bound?  bw
bound?  memory bound?  storage bound?) then it really, really helps to
have a "rich" set of microbenchmarks -- at least at the level of lmbench
if not beyond.

Otherwise you might know how fast it (comparably) runs Joe Down the
Hall's application, but without a lot of study you won't have any idea
what that means.  Something that is to some extent true even of suites
of macro benchmarks like SPEC -- unless you really take the time to work
through the code and see what it looks like, even a "typical monte
carlo" benchmark component might not scale at all like YOUR monte carlo
benchmark because theirs might be in 2D with binary (Ising) spins and
sized to fit into cache on a modern 64 bit CPU while yours might be in
4D with O(3) spins and a fair bit of trig per update, or theirs might be
Metropolis and yours might be heat bath or cluster (sampling method)
with very different algorithms.

Just for a single example -- stream is very popular, but omits division
and doesn't give you a good picture of memory access times in LESS than
ideal circumstances where instead of streaming through vector like one
bops all over.  There is a pretty big difference there in both cases,
and yes, real code sometimes requires actual division, real code
(especially simulation code) sometimes cannot be made "cache local" and
sometimes cannot be made memory local at all.  Stream, or application
benchmarks LIKE stream, that are linear algebra, multiply/add heavy may
give you no clue as to how the system will respond if it is asked to
divide numbers pulled from all over memory, maybe with a bit of trig or
other transcendental calls mixed in.

And did I mention the variations associated with compiler?  Or sometimes
with bios configuration, memory, operating system (the code almost
certainly contains systems calls if only for memory management).

There is nothing wrong with SPEC as a measure of general purpose
performance, and things like HPCC have been engineered by smart people
who know all of this.  They are useful.  Applications (where you know
what they are) that will run on the system need no justification to use
as a benchmark.  However, if one needs to be able to ESTIMATE upper or
lower performance bounds on code you haven't seen before, code that may
not even exist yet, microbenchmarks are very, very useful.

At least then one can say "Gee, I don't know specifically about your
code, but small-message latency on this system is X, and here is a graph
of latency/bandwidth as a function of packet size, this shows you how
long it takes to memcpy a block of memory, here are some curves that
show how it degrades for different sizes of memory (as they sweep across
cache boundaries) and use non-favorable patterns, here is how fast it
generates the primary transcendentals for this particular compiler -- if
you know how much your code does these things, you can at least estimate
your code's performance on this hardware with this OS and compiler and
network."

Perhaps fortunately (perhaps not) there is a lot less variation in
system performance with system design than there once was.  Everybody
uses one of a few CPUs, one of a few chipsets, generic memory,
standardized peripherals.  There can be small variations from system to
system, but in many cases one can get a pretty good idea of the
nonlinear "performance fingerprint" of a given CPU/OS/compiler family
(e.g. opteron/linux/gcc) all at once and have it not be crazy wrong or
unintelligible as you vary similar systems from different manufacturers
or vary clock speed within the family.  There are enough exceptions that
it isn't wise to TRUST this rule, but it is still likely correct within
10% or so.

     rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu