[Beowulf] Is there really a need for Exascale?

Fri Nov 30 12:41:52 PST 2012

On 11/30/2012 03:25 PM, Eugen Leitl wrote:
> Absolutely. CUDA is a lot like assembler that way, and assembler
> has been almost completely displaced by low-level but hardware-independant
> languages like C.
>
> You can't tune as much in OpenCL, but on the other hand, you
> don't have to. The achievable performance is lower, but more
> uniform across diverse platforms. The JIT knows the hardware,
> so that you don't have to.

I wish that were true.  How to decompose between threads/blocks 
(items/groups in OpenCL), how to balance using the speed of using shared 
(local) memory vs the reduction in occupancy, etc. are all hardware 
dependent things that the JIT can't and doesn't hide from you.

To extend your assembly analogy, OpenCL tried to be a perfectly general 
assembly language that one could write in to target AMD GPUs *and* CUDA 
GPUs *and* Intel/AMD multicore processors, *and* IBM Cells, etc.   That 
was always going to end badly.   Not only do you not get performance 
portability - a multicore processor is not very much like a GPU, JIT or 
no JIT - but the generality means that "hello world" in OpenCL is 100+ 
lines longer than in CUDA.  And that's why almost no one bothers 
teaching OpenCL  (Check out the count difference between "Intro to CUDA" 
and "Intro to OpenCL").   I'm all for open standards, but they have to 
standardize something that makes sense.

The cycle of programming for performance in new hardware is always that 
the enthusiastic early adopters have to program in the hardware-specific 
low-level stuff for a while, and eventually compilers and or new 
programming models catch up.  I'm hoping that OpenACC is the start of 
that second stage.  For *really* good performance on tricky problems 
users will still have to fall back to CUDA or something else for AMD 
GPUs; but then a number of users here and even community codes (eg, 
gromacs) still have hand-coded assembly for a few architectures to make 
sure the right bits of their kernels get vectorized properly, etc.

   - Jonathan
-- 
Jonathan Dursi <ljdursi at scinet.utoronto.ca> SciNet;Compute/Calcul Canada