[Beowulf] Performance characterising a HPC application

Thu Mar 22 19:28:24 PDT 2007

> I would call that "the application".  seriously, a benchmark which 
> does what actual apps do, but is somehow not the app?  is there some
> reason to believe that there is not some sort of basis set of primitives
> which actual app performance can be factored into?

Mark, the fundamental problem is benchmarks can be gamed. 2-node
benchmarks are especially sensitive to this: real apps use all the
cores on a node, involve lots of nodes, and they don't talk to just 1
partner.

Compare the latency numbers in HPC Challenge to the 2-node ping-pong
latency reported by vendors. For some vendors, it's the same number.
For others, the latency from using all the nodes is much, much higher.

Note that the new MVAPICH has message coalescing, which causes its
2-node streaming bandwidth and message rate to rise. Note that real
apps rarely have that message pattern -- instead, they send a single
message each to lots of other nodes before synchronizing. Message rate
benchmarks like "base" HPCC Gups get no benefit from message
coalescing.

What I meant was to create a benchmark which does the same data
transfer as the real app. For example, halo exchanges in 2D and 3D.
That's a lot closer to the actual app, and the scaling to large
clusters will be very revealing. (What it doesn't include is the cache
busting effects of the real app, but you can add that in, too. See
Keith's paper from the last EuroPVM/MPI.)

HPC Challenge is much better than what has come before, but it too can
be gamed.  Optimized GUPS doesn't mean anything anymore. PTRANS can be
"optimized" by arranging the nodes such that all communication is
intra-node. And guess what? HPCC results are hard to come by, even though
it's pretty easy to run.

Trust me, I'd love to see microbenchmarks which attack the real issues
that speed up applications. But usually they miss the mark, and my
attempt to create a new one (message rate) is now destroyed by message
coalescing. I should have used an N-node benchmark instead.

-- greg