[Beowulf] Benchmark between Dell Poweredge 1950 And 1435
Mark Hahn
hahn at mcmaster.ca
Thu Mar 8 10:26:30 PST 2007
> Great thanks. That was clear and the takeaway is that I should pay attention
> to the number of memory channels per core (which may be less than 1.0)
I think the takeaway is a bit more acute: if your code is cache-friendly,
simply pay attention to cores * clock * flops/cycle.
otherwise (ie, when your models are large), pay attention to the "balance"
between observed memory bandwidth and peak flops.
the stream benchmark is a great way to do this, and has traditionally
promulgated the "balance" argument. here's an example:
http://www.cs.virginia.edu/stream/stream_mail/2007/0001.html
basically, 13 GB/s for a 2x2 opteron/2.8 system (peak flops would
be 2*2*2*2.8=22.4, so you need 1.7 flops per byte to be happy.
I don't have a report handy for core2, but iirc, people report hitting
a wall of around 9 GB/s for any dual-FSB core2 system. assuming dual-core
parts like the paper, peak theoretical flops is 37 GFlops, for a balance
of just over 4. that ratio should really be called "imbalance" ;)
quad-core would be worse, of course.
More information about the Beowulf
mailing list