[Beowulf] Barcelona numbers

richard.walsh at comcast.net richard.walsh at comcast.net
Mon Sep 10 21:32:37 PDT 2007

Greg Lindahl <lindahl at pbm.com> wrote:

> On Tue, Sep 11, 2007 at 03:32:29AM +0000, richard.walsh at comcast.net wrote:
> > So with all 8 cores at work from 2 sockets you are seeing 70% of peak assuming
> > you are using 667 MHz DDR2
> I'm not sure what this word "peak" means -- it cannot be achieved by
> any test under any circumstances, whereas for a processor floating
> point peak, you can usually do it with some weird code.
> Much better to call the measured STREAM number the actual peak; then I
> won't have to let loose the memory bandwidth peak bot complainer once
> again ;-)

Yes, yes ... ;-) ... I like the stream numbers too, but I also like to know how much of
the advertized capacity of memory bus one can get.  We know what both AMD
and Intel say their system can deliver, and we measure value of each design
by comparing the percentage of the advertized capacity they deliver on a 
benchmark of interest.  In the of memory bandwidth it is revealing I think ... Oui?

On other hand, if peak is a dirty word I will refrain from using it in polite company 
... ;-) ...

> > I thought first byte latencies were around 65 nanos for Opteron.  Am
> > I confused?  
> You're misremembering. Opteron latency was always a function of the
> number of active sockets, and it is usually measured with only one
> core active, while Bill is doing the more realistic thing of having
> all the cores active. Run the same code on your favorite Intel if you
> want to compare.

Granted, latency measures depend on the nearness of the memory
referenced (ala cc-NUMA) to the location of the thread and the number
of threads that are active, but I thought Bill's 1 thread results were also
quite a bit larger than expected.  Maybe I need to look at the Intel numbers
for Bill's again test too.  Perhaps I was comparing Intel's ideal numbers to Bill's 
real world AMD ones and that is what I was "misremembering." Do you expect
the best case first byte latencies for a single-core run refering to cc-NUMA-local
memory on the Barcelona to roughly (5-10%) equal those of dual-core
socket 1207 and/or  socket 940 ... this is what I was thinking initially, but
perhaps Bill's result and fact that there is an L3 cache to consider changes




