[Beowulf] Processors that can do 20+ GFLOPS/Watt

Thu Oct 4 13:05:51 PDT 2012

http://www.streamcomputing.eu/blog/2012-08-27/processors-that-can-do-20-gflops-watt/

Processors that can do 20+ GFLOPS/Watt

by Vincent Hindriksen

    August 27, 2012
    16 Comments

For yearly power-usage there is a rule-of-thumb which states that a device that is continuously on, costs the amount of Watt times 1.5 in Euro per year. So the computer in front of me, that takes around 107 Watt, costs me €160 a year if I would leave it on. A moderate cluster with several GPUs of a few hundred Watts each, would cost a few thousand Euros a year. I would say: very doable for most companies.

So why is the performance per Watt? There is more to a Watt than just the costs. The energy to cool a cluster is quite high, as most of the energy escapes via heat. And then there is the increase in demand for portable power. In cases you are thinking of sweeping you credit card for a top 10 supercomputer, then these energy-costs are extremely high.

In this article I try to get an overview of who is entering the 20+ GFLOPS/Watt area. All processors that do less than 20 GFLOPS/Watt, need to have other advantages to survive. And you’ll see that all the green processors are programmed with OpenCL, the technology StreamComputing is all about.
The list

Understand that since I mix CPUs, GPUs and SoCs (= CPU+GPU) the list is really only an indication of what is possible. Also a computer is built up of more energy-consuming parts than just the processors: interconnects, memory, harddrives, etc.

Disclaimer: The below list is incomplete and based on theoretical values. TDP is assumed to be consumed when processor is working at maximum performance. Actual FLOPS/Watt values can be much lower, depending on many factors. If you want to buy hardware specifically for the purpose of highest FLOPS/Watt have your software tested on the device.
Processor	Type	GFLOPS (32bit)	GFLOPS (64bit)	Watt (TDP)	GFLOPS/Watt (32bit)	FLOPS/Watt (64bit)
Adapteva Epiphany-IV	Epiphany	100	N/A	2	50	N/A
Movidius Myriad	ARM SoC: LEON3+SHAVE	15.28	N/A	0.32	48	N/A
ZiiLabs	ARM SoC	58	N/A	?	20?	N/A
Nvidia Tesla K10	X86 GPU	4577	190	225	20.34	?
ARM + NEON T604	ARM SoC	8 + 68	N/A	4?	19?	N/A
NVidia GTX 690	X86 GPU x 2	5621	234?	300	18.74	0.78
GeForce GTX 680	X86 GPU	3090	128	195	15.85	0.65
AMD Radeon HD 7970 GHz	X86 GPU	4300	1075	300+	14.3	3.58
Intel Knight's Corner (Xeon Phi)	X87?	2000?	1000	200?	10?	5?
AMD A10-5800K + HD 7660D	X86 SoC	121 + 614	?	100	7.35	?
Intel Core i7-3770 + HD4000	X86 SoC	225 + 294,4	112 + 73.6	77	6.74	2.41
IBM Power A2	Power CPU	204?	204	55	3.72?	3.72
Intel Core i7-3770	X86 CPU	225	112	?	?	?
AMD A10-5800K	X86 CPU	121	60?	?	?	?

The list contains recent and general available processors, but I will add any processor you want to see in the list – just request them in a comment.

Please also point me to sources where official data can be found on these processors, as it seems to be top-secret data. As not all the data was available, I had to make some guesses.
CPU vs GPU

Let’s be clear:

    A GPU needs a CPU as a host.
    A GPU is great in vector-computations, a CPU much better in scalar computations.

In other words, a mix between a scalar and a vector processor is best. But once a problem can be defined as a vector-problem, the GPU is much, much faster than a CPU.
64 bit vs 32 bit

As the memory-usage is energy-consuming and results in half the number of data showing up at the processor, we have two reasons why more energy is consumed. Due to architecture-differences, CPUs have a penalty for 32 bit and GPUs a penalty for 64 bit.

Notice that most X86-alternatives have no 64 bit support, or just recently started with it. GPUs crunch double precision numbers at a fourth or less of the 32-bit performance-roof.
Architectures

ARM, X86/X87, Power and Epiphany all have different architecture-choices to get their targeted trade-off between precision, power-consumption and performance-optimisation (control unit). These choices make it sometimes impossible to get with the pace of other architectures in a certain direction.
Current winner: Adapteva Epiphany

Their 64-core Epiphany-IV is programmable with OpenCL and the 50 GFLOPS/Watt makes it worth to put time in porting software if you need a portable device. People who have ported their software to OpenCL already have an advantage here. Adapteva even claims 72 GFLOPS/Watt, as you can read here. With a 100-core CPU coming up, they will probably even raise the bar.

X86 CPUs have the advantage of precision and legacy code, of which precision is the biggest advantage. As X86 GPUs (with Nvidia on top) have a great performance/Watt entering the 20+ GFLOPS/Watt, this could be very interesting for defending the X86 market against ARM.

ARM-processors have a lot of software written for it (via Android) and is very flexible in design, while keeping power-usage for the CPU-part around 1Watt. For instance ZiiLabs’ processor can be compared to the design of Adapteva, but then with an ARM-CPU attached to it.
Conclusion

There is much more than just this number of GFLOPS/Watt, and which architecture will be mainstream architecture in a few years one can only speculate on. Luckily recompiling for other architectures is getting easier with compiler-technologies such as LLVM, so we don’t need to worry too much. Except to redesign our software for multi-core of course. You have read above that new architectures are programmed with OpenCL. It is better to invest in this technology now than later.
More reading

As memory-access takes energy, minimising memory-calls can lower consumption. This article on the ARM blog explains how this is done with MALI GPUs.

The Mont Blanc project is a supercomputer based on ARM. This 12 page PDF shows some numbers and specifications of this supercomputer.

As supercomputers eat lots of power, The Green 500 tries to stimulate to build greener HPC.

Related content:

    Power to the Vector Processor
    AMD’s answer to NVIDIA TESLA K10: the FirePro S9000
    Let’s enter the Top500 HPC list using GPUs

     david moloney

    If you were at HotChips 2011 you would have seen Movidius Myriad which delivers 50GFLOPS/W
        streamcomputing

        Ok, added to the list – I will update the text later. How much GFLOPS does it deliver?
    david moloney

    Also Epiphany is shown as ARM based in your table which I’m sure must be a mistake.
        streamcomputing

        Oops, that’s not ARM at all! Thanks for noticing!
            pip010

            64 RISC units, but not mentioning what. good chance it is ARM!
    MySchizoBuddy

    Can you include Tilera chips on the list.
    http://www.tilera.com/

    I don’t know how many GFlops it delivers
        streamcomputing

        True, not mentioned. I only found http://www.tgdaily.com/business-and-law-features/39408-tilera-goes-pro-with-tilepro64 from 2007 – they do not provide any information on actual performance/Watt anywhere. Or very hidden.
    PENG ZHAO

    How about Nvidia Tesla K10 and Geforce GTX 690? I found some figures.
    Tesla K10:
    Power: 225 W
    Single float: 4577 Gigaflops, 20.342 GFlops/W
    Double float: 190 Gigaflops, 0.8444 GFlops/W

    GTX 690:
    Power: 300 W
    Single float: 5621Gigaflops, 18.74 GFlops/W
    Double float: ?

    The single float computation power is impressive, but the double float one is rubbish.
    Even worse Nvidia seems to stop the update of their OpenCL implementation.
        streamcomputing

        The GTX 690 is a double GPU, so therefore I chose to put the 680 in the list – maybe good point to add double-GPU cards too.

        It seems that my source for the K20 was completely wrong. I’ll update for the K10 for now.
    E P

    The table says FLOPS/Watt, instead of GFLOPS/Watt.

    Can you, please, include Integer arithmetic?

    Depending on floating point (especially 32-bit) is sometimes not an option due to accumulation of errors. So, a lot of integer arithmetic algorithms have been developed. Main point in porting them to OpenCL will be keeping the integer arithmetic calculations. And, that becomes even more important having in mind that a lot of devices increase performance and/or decrease power consumption at the expense of accuracy.
        streamcomputing

        It was extremely difficult (and exhausting) to find the data already in the list. I will therefore focus on what is already there and try to complete the list for just 32-bit and 64-bit (being it floats or integers).

        The trade-off between precision and the other aspects of computing is an interesting subject though.

    Pingback: Processors that can do 20 GFLOPS/Watt | Adapteva
    http://twitter.com/codedivine rahul garg

    Corrections:

    1. The 3770K’s peak (CPU-only) is about 225 GFlops (at base frequency, with turbo slightly higher).

    2. Knight’s corner has fp32 at twice the rate of fp64. So I expect 2 teraflop for fp32 for knights corner.

    3. 3770K CPU-only fp64 peak is half of fp64 peak = 112 gflops.
    http://twitter.com/codedivine rahul garg

    Another correction: HD 4000 on 3770K has fp64 peak of 73.6 gflops.
        streamcomputing

        Thanks for all the feedback! You’re great! Together we can make the picture.
    http://twitter.com/daphreak Stuart

    http://www.kalray.eu/en/technology/mppa-256.html seems to be a good performer but no OpenCL support.