[Beowulf] ARM cpu's and development boards and research
Vincent Diepeveen
diep at xs4all.nl
Tue Nov 27 16:32:22 PST 2012
On Nov 28, 2012, at 12:17 AM, Prentice Bisbal wrote:
>
> On 11/27/2012 03:37 PM, Douglas Eadline wrote:
>>
>>> My interest in Arm has been the flip side of balancing flops to
>>> network
>>> bandwidth. A standard dual socket (AMD or Intel) can trivially
>>> saturate
>>> GigE. One option for improving the flops/network balance is to add
>>> network bandwidth with Infiniband. Another is a slower, cheaper,
>>> cooler
>>> CPU and GigE.
>>>
>> applause.
>
> I applaud that applause.
>
> What Bill has just described is known as an "Amdahl-balanced system",
> and is the design philosophy between the IBM Blue Genes and also
> SiCortex. In my opinion, this is the future of HPC. Use lower power,
> slower processors, and then try to improve network performance to
> reduce
> the cost of scaling out. Essentially, you want the processors to be
> *just* fast enough to keep ahead of the networking and memory, but no
> faster to optimize energy savings.
For HPC the winning concept seems to be increasing corecount at
manycores.
We also see how bluegene couldn't keep its concept - it's having what
is it 18+ cores
now or so?
So the manycores have won the battle in HPC bigtime, for codes that
can get vectorized.
If we look at ARM for example, constructing a huge supercomputer with
it is from production
viewpoint already impossible as just the size of L1 and L2 caches
together already makes it too expensive
to produce at a competative price versus a single huge manycore chip.
Suppose you would have a quadcore ARM compete with a Nvidia K20.
The K20 has 1.3 Tflop. that's 1300 Gflop.
The quadcore ARMs are 1 Ghz as well. Each CPU has a L2 cache up to
1MB L2 cache and a 32KB + 32 KB L1 cache.
So if one would need to produce 325 ARMs, we're speaking of in total
64 * 325 = 20.8 MB worth of L1 caches
and 325 MB worth of L2 caches.
Producing that K20X is a fraction of the price of that.
Those 64 bits ARMs are going to eat around a watt or 6 at full load.
6 watt * 325 cpu's = 1950 watt and with 325 cores you need cooling
everywhere.
Now this is an extreme example, but it shows you clearly that for HPC
small processors can never compete with giant manycores
from a production price viewpoint.
I don't know what price IBM delivers bluegene for, but i'm sure the
next generation
bluegene CPU will have more cores than the current one.
We can be sure IBM finds the solution there they can offer to their
clients at a competative manner - yet it will be also getting an ever
bigger chip they have to offer in order to compete against that
Nvidia offers there now.
As for Intel - they didn't do statements on whether Xeon Phi has
cache coherency.
Yet i assume they will need to drop that or make it manual.
As for networking i also disagree there.
To do a calculation that uses a FFT, be it an approximation that
could have backtracking errors like with floating point FFT,
or be it a lossless calculation using a NTT.
In both cases the network needs these algorithms have is a fraction
of the CPU power they posses.
So if you have a manycore with enough RAM on it, you can to a large
extend do a calculation in there without communicating that.
Only after a bunch of iterations you need to communicate with other
parts of the network.
So the communication in that sense needs O ( log n ) of what you
actually calculate. Now the last phases of the algorithm are a tad
more tricky to do in that manner, but it's the same principle.
Even then the bandwidth needed still is massive - not because it
isn't O ( log n ), but simply because a card that delivers 1+ Tflop
simply is that much for todays standards. Biggest bandwidth you can
achieve over 1 FDR network card is just a fraction of the bandwidth
that 1 Tflop represents.
If we speak about FMA's (fused multiply adds), we speak about 666
billion FMA's a second.
So that's reading 3 doubles and writing 1. A total bandwidth of 4
doubles times 8 bytes times 0.666 TB = 32 * 0.666 = 21+ TB/s
in internal bandwidth that such manycore is handling.
FDR infiniband delivers just a fraction of that obviously - so the
reality already is that the network just delivers a small fraction
of what the GPU can handle.
>
> The Blue Genes do this incredibly well, so did SiCortex, and Seamicro
> appears to be doing this really well, too, based on all the press
> they've been getting. With the DARPA Exascale report saying we
> can't get
> to Exascale with current power consumption profiles, you can bet this
> will be a hot area of research over the next few years.
They'll find a solution - i'm sure of it - yet it will invole massive
amounts of cores at a single CPU.
Whether it's 3 dimensional or just the next generation transistors or
some new technology - we'll see that soon :)
>
> Okay. I'm done listening to myself type.
>
> Prentice
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list