[Beowulf] Performance characterising a HPC application

Fri Mar 23 00:23:32 PDT 2007

Greg,

Greg Lindahl wrote:
> Compare the latency numbers in HPC Challenge to the 2-node ping-pong
> latency reported by vendors. For some vendors, it's the same number.
> For others, the latency from using all the nodes is much, much higher.

The ring test in HPC is rather poorly implemented: 3 iterations only to 
measure something in the same order of magnitude than the precision of 
MPI_Wtime(). Someone just failed Benchmarking 101. If you replace a 
gettimeofday() implementation of MPI_Wtime() by a cycle counter one, 
then the numbers change quite a bit.

However, I agree with you, this is the right way to measure the 
sensitivity to concurrent traffic.

> Note that the new MVAPICH has message coalescing, which causes its

It is unbelievable that so few people denounce it. It is clearly 
implemented only to cheat on a micro-benchmark. What's next ? Checking 
that the buffer to send is identical to the previous one to avoid 
sending "redundant" messages in ping-pong ?!?

> message each to lots of other nodes before synchronizing. Message rate
> benchmarks like "base" HPCC Gups get no benefit from message
> coalescing.

HPCC Gups already does some sort of coalescing. If updates are going to 
the same process, then they are put in the same bucket. The size of 
messages depend on the number of updates in the buckets, so smaller 
number of nodes means bigger messages. I don't understand why they would 
do that, it defeats the goal of scalability testing.

> HPC Challenge is much better than what has come before, but it too can

I think HPCC is somewhat a regression compared to the NAS for example. 
The communication benchmarks are too analytic, not functional enough.

> intra-node. And guess what? HPCC results are hard to come by, even though
> it's pretty easy to run.

And HPCC is a pain in the bottom to compile and run. HPL is not really a 
shinning example of straightforward build process, and configless 
operations, so why build HPCC on top of it ? Is autoconf still too 
bleeding edge these days ? Argh ! And What about the three dozens 
parameters in the config file ?!? It's just insane.

I like the NAS benchmarks. You can run each of them independently, only 
choose the problem size and the number of processes. Easy to run, easy 
to compare. Pallas is nice too, anybody can run it.

> Trust me, I'd love to see microbenchmarks which attack the real issues
> that speed up applications. But usually they miss the mark, and my
> attempt to create a new one (message rate) is now destroyed by message
> coalescing. I should have used an N-node benchmark instead.

If you want to show the impact of concurrent communications, something 
latency-based like the HPCC ring test is the best way (eventually with 
more nodes). The millions of packet per second of a stream-based 
benchmark are lovely for the marketing folks, but has little meaning for 
real codes that computes a minimum. However, an alltoall on many 
cores/nodes would exercise the same metric (many sends/recvs on the same 
NIC at the same time), but would be harder to cheat and be much more 
meaningful IMHO.

Patrick
-- 
Patrick Geoffray
Myricom, Inc.
http://www.myri.com