[Beowulf] MPI application benchmarks
Toon Knapen
toon.knapen at fft.be
Mon May 7 23:38:34 PDT 2007
Robert G. Brown wrote:
> Perhaps fortunately (perhaps not) there is a lot less variation in
> system performance with system design than there once was. Everybody
> uses one of a few CPUs, one of a few chipsets, generic memory,
> standardized peripherals. There can be small variations from system to
> system, but in many cases one can get a pretty good idea of the
> nonlinear "performance fingerprint" of a given CPU/OS/compiler family
> (e.g. opteron/linux/gcc) all at once and have it not be crazy wrong or
> unintelligible as you vary similar systems from different manufacturers
> or vary clock speed within the family. There are enough exceptions that
> it isn't wise to TRUST this rule, but it is still likely correct within
> 10% or so.
I agree that this rule is true for almost all codes ... that are
perfectly in cache and that do not try to benefit from specific
optimisations.
HPC codes however are always pushing the limits and this means you will
always stumble on some bottleneck somewhere. Once you removed the
bottleneck, you stumble on another. And every bottleneck mask all others
until you remove it.
E.g. it was already mentioned in this thread that one should not forget
to pay attention to storage. However often people run parallel codes
with each process performing heave IO without an adapted storage system.
Or another example, GotoBLAS is well known to outperform netlib-blas.
However, in an application calling many dgemm's on small matrices (up to
50x50), netlib-blas will _really_ (i.e. a factor 30) outperform GotoBLAS
(because GotoBLAS 'looses' time aligning the matrices etc. which becomes
significant for small matrices)
toon
More information about the Beowulf
mailing list