[Beowulf] What happened to AMD GPU?

C Bergström cbergstrom at pathscale.com
Wed Mar 4 14:29:40 PST 2015

On Thu, Mar 5, 2015 at 5:18 AM, Massimiliano Fatica <mfatica at gmail.com> wrote:
> I would not draw too many conclusions, the SpecAcc is just telling you the
> quality of the OpenACC  compiler and the quality of the porting.
> For example, if you look at the results for CloverLeaf  ( I am familiar with
> this application and have other reference points), you have:
> AMD/Pathscale: 3.13 specaccel_peak
> NVIDIA/PGI:       3.45 specaccel_peak

To state it again - our compiler is not perfect. There's a couple
things blocking us from hitting numbers 4+ in certain benchmarks.

> Keeping the HW constant and changing the software ( adding CUDA C and CUDA
> Fortran to the mix)  will give you
> for the 3840x3840 grid  the following  average times per cell  (measured in
> 10^-8s):
> OpenACC loops: 1.92
> OpenACC kernels: 1.78
> CUDA Fortran; 1.33
> CUDA C: 1.25

I would not compare PGI OpenACC to CUDA and draw a conclusion that
OpenACC is bound to lose. If we beat PGI OpenACC by 30% that
difference starts to narrow quickly.

> Timing is on a K20c, but we are interested in the relative performance. Cuda
> C/Fortran in 30% faster.
> There is also an OpenCL implementation of CloverLeaf but I don't have the
> results. It is probably in the same ballpark.
> This is a "simple" CFD code with regular access pattern, a directive base
> porting gives you decent results.
> You could try to run the OpenCL code on the AMD card and see how far the
> Pathscale compiler is from it, but I am
> expecting something similar.
> OpenACC is an interesting option for people looking for high level
> programming, but you usually pay a penalty.
> How big is the penalty will depend on a lot of factors and it is very
> difficult to generalize.

I think with poorly written CUDA or poorly written OpenACC you'll pay
a penalty in both cases. I think with good OpenACC and a good compiler
(after we fix some bugs) - that general perception will start to
narrow. (Yes highly tuned CUDA will probably always win, but by how

The thing to keep in mind is that in our compiler, unlike every other
implementation - we are not doing any source-to-source or dumping
1) Our code generator targets bare metal instructions
2) It's optimized for HPC - not just a recycled shader compiler

Our GPU transformations *know* the hardware and how to map the right
grid sizes to the resources underneath. When that mapping is done
correctly and in combination with good old code generation == win.

More information about the Beowulf mailing list