[Beowulf] Benchmark between Dell Poweredge 1950 And 1435

Thu Mar 8 10:26:30 PST 2007

> Great thanks. That was clear and the takeaway is that I should pay attention
> to the number of memory channels per core (which may be less than 1.0)

I think the takeaway is a bit more acute: if your code is cache-friendly,
simply pay attention to cores * clock * flops/cycle.

otherwise (ie, when your models are large), pay attention to the "balance"
between observed memory bandwidth and peak flops.

the stream benchmark is a great way to do this, and has traditionally
promulgated the "balance" argument.  here's an example:

http://www.cs.virginia.edu/stream/stream_mail/2007/0001.html

basically, 13 GB/s for a 2x2 opteron/2.8 system (peak flops would 
be 2*2*2*2.8=22.4, so you need 1.7 flops per byte to be happy.

I don't have a report handy for core2, but iirc, people report hitting
a wall of around 9 GB/s for any dual-FSB core2 system.  assuming dual-core
parts like the paper, peak theoretical flops is 37 GFlops, for a balance
of just over 4.  that ratio should really be called "imbalance" ;)
quad-core would be worse, of course.