[Beowulf] Benchmark between Dell Poweredge 1950 And 1435

Thu Mar 8 11:25:29 PST 2007

Mark,
Thanks, that led me (with a bit of wandering) to e.g.
http://www.cs.virginia.edu/stream/top20/Balance.html.
My immediate concern is for an app that is worse than embarassingly
parallel; it can't (currently) trade memory for time, and can't really use
any memory or network effectively, by the list's standards. Basically I want
a zillion CPUs and they can communicate by crayon on postcard. That's not
practical, but my initial valuator is just GHz/$.
I care about the memory sharing and message passing efficiency issues only
in that I want to smarten up my app to take advantage of other economies.
Peter

On 3/8/07, Mark Hahn <hahn at mcmaster.ca> wrote:
>
> > Great thanks. That was clear and the takeaway is that I should pay
> attention
> > to the number of memory channels per core (which may be less than 1.0)
>
> I think the takeaway is a bit more acute: if your code is cache-friendly,
> simply pay attention to cores * clock * flops/cycle.
>
> otherwise (ie, when your models are large), pay attention to the "balance"
> between observed memory bandwidth and peak flops.
>
> the stream benchmark is a great way to do this, and has traditionally
> promulgated the "balance" argument.  here's an example:
>
> http://www.cs.virginia.edu/stream/stream_mail/2007/0001.html
>
> basically, 13 GB/s for a 2x2 opteron/2.8 system (peak flops would
> be 2*2*2*2.8=22.4, so you need 1.7 flops per byte to be happy.
>
> I don't have a report handy for core2, but iirc, people report hitting
> a wall of around 9 GB/s for any dual-FSB core2 system.  assuming dual-core
> parts like the paper, peak theoretical flops is 37 GFlops, for a balance
> of just over 4.  that ratio should really be called "imbalance" ;)
> quad-core would be worse, of course.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20070308/f81261ce/attachment.html>