[Beowulf] AMD performance (was 500GB systems)
bill at cse.ucdavis.edu
Sat Jan 12 18:21:27 PST 2013
On 01/12/2013 04:25 PM, Stu Midgley wrote:
> Until the Phi's came along, we were purchasing 1RU, 4 sockets nodes
> with 6276's and 256GB ram. On all our codes, we found the throughput
> to be greater than any equivalent density Sandy bridge systems
> (usually 2 x dual socket in 1RU) at about 10-15% less energy and
> about 1/3 the price for the actual CPU (save a couple thousand $$ per
For many workloads we found similar. The last few generations of AMD
CPUs have had 4 memory channels per socket. At first I was puzzled that
even fairly memory intensive codes scaled well.
Even following a random pointer chain performance almost doubled when I
tested with 2 threads per memory channel instead of 1.
Then I realized the L3 latency is almost half of the latency to main
memory. So you get significant throughput advantages by having a queue
of L3 cache misses waiting for the instant any of the memory channels
In fact even with 2 jobs per memory channel sometimes the memory channel
goes idle. Even 4 jobs jobs per memory channel sees some increases.
The good news is that most codes aren't as memory bandwidth/latency
intensive as the related micro benchmarks (and therefore scale better).
I think the more cores per memory channel is a key part of AMDs improved
throughput per socket when compared to Intel. Not always true of
course, again it's highly application dependent.
> Of course, we are now purchasing Phi's. First 2 racks meant to turn
> up this week.
Interesting, please report back on anything of interest that you find.
More information about the Beowulf