[Beowulf] GPU Beowulf Clusters

Sun Jan 31 11:17:41 PST 2010

On Sun, 31 Jan 2010 21:15:12 +0300
"C. Bergström" <cbergstrom at pathscale.com> wrote:

> Micha Feigin wrote:
> > On Sat, 30 Jan 2010 17:30:31 -0800
> > Jon Forrest <jlforrest at berkeley.edu> wrote:
> >

[snip]

> >   
> >>     
> >
> > People are starting to work with OpenCL but I don't think that it's ready yet.
> > The nvidia implementation is still buggy and not up to par against cuda in
> > terms of performance. Code is longer and more tedious (mostly matches the
> > nvidia driver model instead of the much easier to use c api). I know that
> > although NVidia say that they fully support it, they don't like it too much.
> > NVidia techs told me that the performance difference can be about 1:2.
> >   
> That used to be true, but I thought they fixed that?  (How old is your 
> information)

>From Thursday ... (three days or so). Not personal though, I prefer Cuda.

I've got a friend who's working with Prof. Amnon Barak of the Jerusalem
university who created mosix to do something similar for the GPU and they are
doing it with OpenCL.

One example is that you can pass NULL as the workgroup size and the system
should set an optimal workgroup size automatically. Turns out that NVidia sets
it to 1. Anyone who knows NVidia knows how good that is ...

> > Cuda exists for 5 years (and another 2 internally in NVidia). Version 1 of
> > OpenCL was released December 2008 and they started working on 1.1 immediately
> > after that. It has also been broken almost from the start due to too many
> > companies controling it (it's designed by a consortium) and trying to solve the
> > problem for too many scenarios at the same time.
> >   
> The problem isn't too many companies.. It was IBM's cell requirements 
> afaik.. Thank god that's dead now..

It's also intel vs. nvidia vs. amd

> > ATI also started supporting OpenCL but I don't have any experience with that.
> > Their upside is that it also allows compiling cpu versions.
> >
> > I would start with cuda as the move to OpenCL is very simple afterwords if you
> > wish and Cuda is easier to start with.
> >   
> I would start with a directive based approach that's entirely more sane 
> than CUDA or OpenCL.. Especially if his code is primarily Fortran.  I 
> think writing C interfaces so that you can call the GPU is a maintenance 
> nightmare and will not only be time consuming, but will later will make 
> optimizing the application *a lot* harder.  (I say this with my gpu 
> compiler hat on and more than happy to go into specifics)

My experience is that it will never be as good but I'll be happy to hear from
personal experience by how much. I'm guessing that you are talking HMPP or
something similar here.

Just moving stuff to the GPU entails very little overhead and gives you much
more control of the memory and communication handling. For stuff that needs
shared memory and/or textures for good performance, you usually need the direct
control anyway.

No experience with HMPP though, I should probably test run it at some point.

Personally I'd love to hear about specifics.

> > Also note that OpenCL gives you functional portability but not performance
> > portability. You will not write the same OpenCL code for NVidia, ATI, CPUs etc.
> > The vectorization should be all different (NVidia discourage vectorization, ATI
> > require vectorization, SSE requires different vectorization), the memory model
> > is different, the size of the work groups should be different, etc.
> >   
> Please look at HMPP and see if it may solve this..

Will do

[... snip again ...]

> >
> > The issue is not only computation complexity but also regular memory accesses.
> > Random memory accesses on the GPU can seriously kill you performance.
> >   
> I think I mentioned memory accesses.. Are you talking about page faults 
> or what specifically?  (My perspective is skewed and I may be using a 
> different term.) 

No, just random memory accesses, think lookup tables. LUTs are horrible
performance wise on the GPU. If you can't get coalescing working for you, you
can get a factor of 8 IIRC performance hit.

[... snip once more ...]

> >
> > You don't have page faulting on the GPU, GPUs don't have virtual memory. If you
> > don't have enough memory the allocation will just fail.
> >   
> Whatever you want to label it at a hardware level nvidia cards *do* have 
> vram and the drivers *can* swap to system memory.  They use two things 
> to deal with this a) hw based page fault mechanism and b) dma copying to 
> reduce cpu overhead.  If you try to allocate more that's available on 
> the card yes it will probably just fail.  (We are working on the 
> drivers)  My point was about what happens between the context switches 
> of kernels.

I'm not aware of intentional swapping done by nvidia. I've had issues with
kernels dying due to lack of memory on my laptop which could have been solved
had there been swapping. And I'm talking memory allocated from different
processes, where for each process it does fit in memory.

There is an issue with window vista/7 vs. xp where windows vista/7 decided to
manage the GPU memory as virtual memory, but again, I'm not sure about actual
swapping. I need to get updated on the exact details as I didn't test drive win
7 too much.

I probably should ask one of the devtecs at NVidia what is done at the driver
level. Pitty I didn't see this thread last week as there were a few of them
around for a visit :(

[ ...]

> >
> > My personal experience though is that it's much harder to use such optimization
> > on the CPU than on the GPU for most problems.
> >   
> CUDA/OpenCL and friends implicitly identify which areas can be 
> vectorized and then explicitly offload them.  You are comparing 
> apple/oranges here..
> 
> 

Cuda/OpenCL do it explicitly actually. You have things like auto vectorization
by the intel compiler but it's very limited in recognizing vectorization code.
For anything big you need to vectorized manually. If you look at the OpenCL
tutorials from ATI they tell you that you need to use float4 if you want the
CPU code to vectorize using sse, it's not done implicitly.