Cluster benchmark(s)?

Wed Jan 17 15:29:23 PST 2001

On Wed, 17 Jan 2001 Randy_Howard at Dell.com wrote:
>> Well, my intent was not to establish specific numbers but rather to
>> get an idea of "bang for the buck" factors with various hardware
>> configurations.  For example, I wonder if there is a way of
>> predicting up front for a given application whether or not 10/100
>> ethernet would be sufficient and not become the primary bottleneck.
>> I understand it is a very complex problem and this may not even
>> be possible.

On Wed, Jan 17, 2001 at 04:19:01PM -0500, Robert G. Brown wrote:
> Oh, it's possible all right but it isn't easy.  As you say, it's
> (fundamentally!) a complex problem, so you have to learn to understand
> and manage the complexity.  A general methodology might be outlined
> something like:
...
<snip very valid, but necessarily lengthy procedure>

Hmm. This is all very valid and correct, but unfortunately quite
overwhelming, particularly if you want to build your first
cluster. I'm wondering whether it wouldn't be possible
to establish a database of cluster benchmarks that could provide
hints (these won't be more than hints, but nevertheless these could
be helpful).

Here is the idea: There should be benchmarks and speedup data for different
type of cluster applications:
1. Embarrassingly parallel (e.g., Monte-Carlo simulations).
   In this case the benchmark will be dominated by the CPUs, the
   interconnect is unimportant, the speedup curve will show linear
   scaling for (almost) unlimited number of processors.
2. Applications with "nearest neighbour" communications
   (e.g., finite-difference methods for PDEs). In this case there is
   significant communication between processors, however, since the
   communication is local (i.e., processor n only talks with n+1 and
   n-1) the scaling of the communication time with the # of processors
   is not so bad (constant + probably a small linear piece).
   In this case you should see a maximum in the speedup curve the location
   of which depends on you interconnect.
3. Applications with pairwise (all-to-all) communications
   (e.g., parallel FFT). In this case the time for communication scales
   proportional to the square of the # of processors. The benchmark will
   be dominated by the speed of the interconnect, i.e., the speedup curve
   will show minimal speedups (or even speedups < 1) for fast ethernet.

There may be a few more cases (but probably not many more).
A real application will be a mixture of these three scenarios. But if you
know how, e.g., a PIII/800MHz cluster with fast ethernet scales in these
cases, you at least have some hints how your own application may scale
on certain architectures.

Sure, there are complications: The results depend on the MPI distribution
used: e.g., lam works best when small latencies are required, mpipro is
good when high throughput is required, etc.

But nevertheless, I'm sure something like this would have helped me when
I set up my first cluster.

Comments?

Cheers,
Martin

========================================================================
Martin Siegert
Academic Computing Services                        phone: (604) 291-4691
Simon Fraser University                            fax:   (604) 291-4242
Burnaby, British Columbia                          email: siegert at sfu.ca
Canada  V5A 1S6
========================================================================