[Beowulf] CPU Benchmark Result

Robert G. Brown rgb at phy.duke.edu
Wed Dec 8 09:50:57 PST 2004

On Wed, 8 Dec 2004, Rajiv wrote:

> Dear Sir,
> I would like to get CPU benchmark results for various architectures.
> Any good sites where I could find this information. I found the site
> http://www.unc.edu/atn/hpc/performance/ useful. But I am unable to
> access - I am asked for username and password. How can I access this
> site.

Here are at least some of the the primary/famous benchmarks:

  SPEC -- probably the "best" of the application-level benchmark suites.
Fairly tight rules, but deep-pocketed vendors doubtless maintain an
edge beyond just having decent hardware.

  lmbench -- I think without question the best of the microbenchmark
suites.  If you want to find out how fast the CPU does any basic
operation, this is probably the first place to look.  This suite is
heavily used by the linux kernel developers including Linus Himself
because it provides accurate and reproducible timings of things like
interrupt handling rates, context switch rates, memory latency and
bandwidth, and some selected CPU operational rates.

  stream -- If you are interested in CPU-memory combined rates in
operations on streaming vectors (e.g. copy, add, multiply-add) stream is
the microbenchmark of choice.  Its one weakness is that it doesn't
provide one (easily) with a picture of rates as a function of vector
size, so that one cannot observe the variation as one increases the
vector size across the various CPU cache sizes.  It is therefore better
suited (as a predictor of application performance) for people running
large applications involving linear algebra than for people operating on
small blocks of data.  Oh, and another weakness is that it doesn't
provide any measure that includes the division operation.  This is
important because some code REQUIRES division in a streaming context or
otherwise, and division is often several times slower than

  linpack -- Another linear algebra type benchmark.  Not terribly
relevant at the application level any more, and a bit too complex to be
a microbenchmark -- IMHO this is a benchmark that could be retired
without anyone really missing it for practical reasons.  However, it is
has been around a long time and there is a fair bit of data derived from
it.  When someone tells you how many "MFLOPS" a system has, they are
probably referring to Linpack MFLOPS.  Historically, this has been a
highly misleading predictor of relative systems performance at the
application level and has also proven relatively easy to "cheat" on a
bit at the hardware and software level, but there it is.

  savage -- This is a nearly forgotten benchmark that measures how fast
a system does transcendental function evaluations (e.g. sin, tan).
These are typically library calls, but some CPUs have had them built
into microcode so that they execute several times faster (typically)
than library code.  Libaries can also exhibit some variation depending
on the algorithms used for evaluation.

Some of these benchmarks are wrapped into one another.  For example, the
HPC Challenge suite will contain stream, and I recall that lmbench has
stream available in it as well now (don't shoot me if that is wrong --
I'm just remembering and could be mistaken).  My own benchmark wrapper,
cpu_rate (available on my website below under either General or Beowulf,
can't remember which) contains stream WITH a variable length vector
size, a stream-like measure of "bogomflops" (arithmetic mean of +-*/
times/rates), savage, and a memory read/write test that permits one to
shuffle the order of access to compare streaming with random access
rates.  It is still a bit buggy and is on my list for more work (along
with about four other projects:-) over Xmas break, but what it is really
designed to be is a shell for drop-in microbenchmarks of your own design
(arbitrary code fragments).  

Benchmarking whole applications is easy -- just use wall-clock time.
Benchmarking small code fragments is remarkably difficult, especially if
their execution time is comparable to the time required to read the most
accurate system clock avaiable (typically the onboard CPU cycle
counter).  Benchmarking e.g. library calls is difficult to do completely
accurately, but you can get a decent idea from using the -p flag and
gmon (profiler) where there is a bit of heisenberg uncertainty in all of
these -- the process of measurement can change the results, hopefully
not too much to be useful.

I'm not providing URLs because all of the above can easily be found with
google, and because I don't know the exact URLs of lists of results
derived from the benchmarks anyway.  SPEC is pretty good about
publishing a result list per submitted architecture.  stream has started
to do this as well, although it is also (unfortunately) playing a
variant of the "Top X" Game where vendors get to tune and are "ranked".
lmbench has the strictest rules of them all -- no vendor tuning
whatsoever and you have to publish a whole SUITE of results if you
publish any one.  The more I look at and write about this stuff, the
more I appreciate what Larry (McVoy) is fighting against...


> Regards,
> Rajiv

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

More information about the Beowulf mailing list