[Beowulf] Cluster Benchmarks?

Robert G. Brown rgb at phy.duke.edu
Mon Jun 14 12:02:09 PDT 2004

On Mon, 14 Jun 2004, Jonathan Michael Nowacki wrote:

> Does anyone know of any impartial benchmark websites?  Something that
> would compare the Xserve vs. Opteron vs. Athlon vs. P4 for scientific
> computing?
> I found this website, but it's internal use only.  Too bad.
> http://www.unc.edu/atn/asg/benchmark/benchmark_2003.html

Curiously enough, I'm working actively on cpu_rate, an impartial
benchmark.  cpu_rate is a fully GPL v2b, open source benchmark that I'm
rewriting in an "object oriented" design where each test is in a more or
less standard wrapper with well-defined init, alloc, free, test and
results routines.  The rest of the code is a reusable, consistent timing
shell that automagically computes how hard it needs to work to get good
precision with a high precision timer (usually the cpu clock itself, but
on an Xserve you might have to use gettimeofday -- the "automagic" part
means that if you do this you'll still get consistent precision but the
inner loops will likely have to run longer to get it).  The timing
harness runs a selectable number of interations of the entire timing
process (default 100) and returns both the mean timing and the standard
deviation, in nanoseconds, of the selected operation.

This is in the general category of "microbenchmark" -- the code in the
tested segment of the testing routines is typically a small code
fragment or even an atomic operation.  However, the harness can manage
vector streams with a single index, and there is even a trick one can
use to automatically test at least memory access both for streaming
(sequential) access and for random/shuffled access, which can be a very
illuminating test.

Tests that it will run at the moment include:

rgb at lilith|B:1112>cpu_rate -l
 #   Name                     Remark
 0 Null test Test validation loop, should take "no time" (infinite rate)
 1 bogomflops d[i] = (ad + d[i])*(bd - d[i])/d[i] (8 byte double vector)
 2 bogomtrids d[i] = (ad + bd - cd)*d[i] (8 byte double vector)
 3  stream copy         d[i] = a[i] (8 byte double vector)
 4  stream scale        d[i] = xtest*d[i] (8 byte double vector)
 5  stream add          d[i] = a[i] + b[i] (8 byte double vector)
 6  stream triad        d[i] = a[i] + xtest*b[i] (8 byte double vector)
 7  memory read/write   Reads, then writes (4 byte integer vector)
 8  memory read         Reads (4 byte integer vector)
 9  memory write        Writes (4 byte integer vector)
10  savage              xtest = tan(atan(exp(log(sqrt(xtest*xtest)))))

Note that it contains the four stream tests, sort of.  I say sort of
because although it tests the stream operations, it uses strictly
malloc'd memory for the vectors and consequently has one more (pointer)
layer of indirection than stream, which allocates the vectors in the
data segment (no dynamic pointers).  The stream results are typically a
tiny bit slower than "stream" per se, but I think they are more useful
as you can observe how stream results vary with vector size as one
sweeps across e.g. various cache sizes and strides.  

I'm also going to experiment a bit to see if I can have a hard allocated
variant of stream independent of a malloc'd version, and use some clever
indirection to avoid malloc'ing memory for the latter until one exceeds
the hard allocated data space.  The "cost" of this will be that the code
will have a rather large default data size even if one is running (say)
savage, which requires no memory at all to speak of.  As noted, the
direct memory tests can use shuffled or sequential access with
dramatically different rates, as one would expect, running out of main
memory for vectors larger than cache.

Alas, as I write this I >>AM<< working on it, and am about 2/3 of the
way through eliminating what I hope/expect is the last pernicious memory
leak.  With luck, it will take me only another hour or two to get the
code to where it runs perfectly for the last three tests and a bit
longer to run the full suite of tests on Celeron, P4, AMD boxen just to
be sure it works there still.  By (maybe) five or six pm EST I'll likely
have the new image up, in what I'd call late alpha or beta mode.

The whole point of the rewrite is that this suite SHOULD be very easy to
add your own code fragment to for testing purposes.  Copy any of the
existing tests (mflops.c and mflops.h, say) to mynewtest.c and
mynewtest.h.  Add a couple of lines to tests.h (one to an enum list, one
include line for mynewtest.h).  Add a line to cpu_rate_startup.c to call
the initialization routine.  Edit mynewtest.c in pretty obvious ways,
documented (I hope) in the comments, compile, run.  Out should spit
nanosecond timings of said operation(s), with standard deviation and a
"bogomega"-rate of same (millions of operations per second).

Eventually it should be straightforward to instrument lots of
microscopic operations and subroutine/library calls.  I wrote this
originally (back in the 80's, MUCH more crudely) to try to answer the
simple question "how fast can this system do a floating point
operation", a thing that vendor estimates and most benchmarks of the day
never answered to anything like my satisfaction.

You can find it either on my personal web pages under Beowulf or on the
brahma site under resources, but you might wait until I announce the new
revision "formally" hopefully later today, along with the URL(s), to
retrieve it.

Hope it helps.


> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

More information about the Beowulf mailing list