[Beowulf] MPI application benchmarks

Mon May 7 23:38:34 PDT 2007

Robert G. Brown wrote:
> Perhaps fortunately (perhaps not) there is a lot less variation in
> system performance with system design than there once was.  Everybody
> uses one of a few CPUs, one of a few chipsets, generic memory,
> standardized peripherals.  There can be small variations from system to
> system, but in many cases one can get a pretty good idea of the
> nonlinear "performance fingerprint" of a given CPU/OS/compiler family
> (e.g. opteron/linux/gcc) all at once and have it not be crazy wrong or
> unintelligible as you vary similar systems from different manufacturers
> or vary clock speed within the family.  There are enough exceptions that
> it isn't wise to TRUST this rule, but it is still likely correct within
> 10% or so.

I agree that this rule is true for almost all codes ... that are 
perfectly in cache and that do not try to benefit from specific 
optimisations.

HPC codes however are always pushing the limits and this means you will 
always stumble on some bottleneck somewhere. Once you removed the 
bottleneck, you stumble on another. And every bottleneck mask all others 
  until you remove it.

E.g. it was already mentioned in this thread that one should not forget 
to pay attention to storage. However often people run parallel codes 
with each process performing heave IO without an adapted storage system.

Or another example, GotoBLAS is well known to outperform netlib-blas. 
However, in an application calling many dgemm's on small matrices (up to 
50x50), netlib-blas will _really_ (i.e. a factor 30) outperform GotoBLAS 
(because GotoBLAS 'looses' time aligning the matrices etc. which becomes 
significant for small matrices)

toon