[Beowulf] Multicore Is Bad News For Supercomputers

Fri Dec 5 18:32:24 PST 2008

Mark Hahn wrote:
>> (Well, duh).
> 
> yeah - the point seems to be that we (still) need to scale memory
> along with core count.

Which seems to be happening.  Suddenly designers can get more real world
performance by adding bandwidth.  This isn't new in the GPU world of course
where ATI and Nvidia have been selling devices for $250-$600 with 70-140GB/sec.

This is however rather new for CPUs, Intel's been dominating the market with
sub 10GB/sec memory systems for some time now, while AMD has had > 10GB/sec
for er, 3 generations now to little effect.

So the older machines had less cores and were more sensitive to latency (and
the resulting nasty laws of physics) are transforming into bandwidth limited
problems that are very friendly to multicore.

So now Intel's shipping a CPU that can run 8 threads and suddenly has 2-3
times the memory bandwidth.  Suddenly intel's gone from trailing AMD by a
factor of 2 or more to matching AMD dual sockets with a single socket.

AMD dual socket shanghai:
Number of Threads requested = 8
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       21638.5276       0.0370       0.0370       0.0371
Scale:      21605.3675       0.0371       0.0370       0.0371
Add:        21451.1315       0.0560       0.0559       0.0562
Triad:      21399.5102       0.0562       0.0561       0.0563

techreport.com reports 21GB/sec on sandra memory bandwidth with a core i7 and
3 x 1333 MHz.  If anyone has a core i7 around I'd be interested in the stream
numbers.

> not just memory bandwidth but also concurrency

Indeed, so now amd dual sockets have 4 memory systems, Intel single sockets
have 3.  Not familiar with ATI/Nvidia details but I assume to make useful
of 100-140GB/sec memory systems that they much have a high degree of parallelism.

AMD dual socket shanghai:
min threads=1 max threads=8 pagesize=4096 cacheline=64
Each threads will access a 262144 KB array 20 times

1 thread(s), a random cacheline per 73.31 ns, 73.31 ns per thread
2 thread(s), a random cacheline per 37.45 ns, 74.90 ns per thread.
4 thread(s), a random cacheline per 19.28 ns, 77.11 ns per thread.
8 thread(s), a random cacheline per 9.84 ns, 78.74 ns per thread.

> (number of banks), though "ieee spectrum online for tech insiders"
> doesn't get into that kind of depth :(
> 
> I still usually explain this as "traditional (ie Cray) supercomputing
> requires a balanced system."  commodity processors are always less balanced
> than ideal, but to varying degrees. 

If you ignore multicore bandwidth and the effective use of bandwidth (read
that as application performance) is going up.  Who cares if the "unbalanced"
machines are running at 5% of peak, as long as HPC application performance
(more closely tied to bandwidth) keeps increasing.

> intel dual-socket quad-core was
> probably the worst for a long time, but things are looking up as intel
> joins AMD with memory connected to each socket.

Indeed, so maybe bandwidth will become more of a design constraint.  Possibly
a fixed amount of memory per CPU, surface mounted memory, and memory busses
wider than is practical with the traditional socket with 4-6 dimms a few
inches away.... till it's feasible to put ram and CPU on the same die
anyways... IRam here we come.  In the mean time maybe motherboards will start
looking like more video cards.  So maybe something like:
* 32-64 cores per socket, less than 5 GHz
* 4 GB of high speed ram ( > 150GB/sec) per socket
* multiple hypertransport like connections to slower memory