[Beowulf] Barcelona numbers

Tue Sep 11 00:09:08 PDT 2007

Latency for an amd64-2.0:
L2 latency  400KB =  8ns (includes L1 latency)
Main memory 16MB  = 63ns (55ns because of memory)

Opteron 275 dual sockt:
L2 latency 800KB = 8ns
Main memory      = 77ns (82ns because of L2)

I believe 77ns is something along the lines of 55ns for memory, 8ns for L2
latency, 2ns for registered memory (1 cycle @ 400 MHz), and 12ns or so for
hypertransport coherency.  55+8+2+12 = 77

Opteron 2350:
L2 latency 400KB = 7.5ns
L3 latency 2.25MB = 23ns (includes L2 latency)
Main memory 32MB  = 100ns

The earlier 136ns number for 2GB I attribute to TLB thrashing, and is 
hopefully fixable with 1GB pages.  I believe the 100ns is something along the 
lines of 63ns for memory, 23 ns for L3, 2ns for registered memory, and 12ns or 
so for hypertransport coherency.  Sadly most DDR2 ECC registered memory I've 
see has a higher (both cycles AND wallclock) than the rather mature ddr-400.

Q6600 (2.4 GHz quad, 4MB L2, single socket):
L2 latency  3MB  =12ns
main memory 32MB =80ns

Xeon 5310 (1.6GHz quad, 4MB l2, dual socket):
L2 latency           3MB =  15ns
main memory latency 32MB = 126.77ns

So basically it looks like AMD still has the lead in memory latency (although
I don't have the latest greatest multi-socket intel quads to compare) lead.
Intel has a bigger transistor budget (with 2 pieces of silicon) yet AMD
looks to have the potential for better throughput with 2 64 bit memory
busses per socket.  Definitely a good battle that's going to benefit the
end user, at least for the short term.

> No, on Opteron it doesn't. The *bandwidth* depends on nearness, the
> *latency* pretty much depends on the last snoop coming back from the
> farthest socket. 

I tried to prove the wrong by example, using numactl and related calls...
and failed.  I did notice in today's news that asus is bragging about
a dual socket board for barcelona that has a split power plain (faster
memory controller and l3 cache) and dual hypertransport connections between
the sockets.

> On systems with directory-based SMP protocols, things are different.
> That's probably what you're used to seeing -- SGI Origin, for example.

Indeed, a related bandwidth instead of latency code produced this:
    http://cse.ucdavis.edu/~bill/altix-dplace.png

Of course my pstream code is embarassingly parallel, each thread access a 
local array and only communicates enough to make sure each stage of the
benchmark happens in sync.  Hardly a good example to show off the altix
interconnect.