[Beowulf] GPU Beowulf Clusters

Sun Jan 31 09:33:58 PST 2010

On Sat, 30 Jan 2010 17:30:31 -0800
Jon Forrest <jlforrest at berkeley.edu> wrote:

> On 1/30/2010 2:52 PM, "C. Bergström" wrote:
> 
> > Hi Jon,
> >
> > I must emphasize what David Mathog said about the importance of the gpu
> > programming model.
> 
> I don't doubt this at all. Fortunately, we have lots
> of very smart people here at UC Berkeley. I have
> the utmost confidence that they will figure this
> stuff out. My job is to purchase and configure the
> cluster.
> 
> > My perspective (with hopefully not too much opinion added)
> > OpenCL vs CUDA - OpenCL is 1/10th as popular, lacks in features, more
> > tedious to write and in an effort to stay generic loses the potential to
> > fully exploit the gpu. At one point the performance of the drivers from
> > Nvidia was not equivalent, but I think that's been fixed. (This does not
> > mean all vendors are unilaterally doing a good job)
> 
> This is very interesting news. As far as I know, nobody is doing
> anything with OpenCL in the College of Chemistry around here.
> On the other hand, we've been following all the press about how
> it's going to be the great unifier so that it won't be necessary
> to use a proprietary API such as CUDA anymore. At this point it's too
> early to doing anything with OpenCL until our colleagues in
> the Computer Science department have made a pass at it and
> have experiences to talk about.
> 

People are starting to work with OpenCL but I don't think that it's ready yet.
The nvidia implementation is still buggy and not up to par against cuda in
terms of performance. Code is longer and more tedious (mostly matches the
nvidia driver model instead of the much easier to use c api). I know that
although NVidia say that they fully support it, they don't like it too much.
NVidia techs told me that the performance difference can be about 1:2.

Cuda exists for 5 years (and another 2 internally in NVidia). Version 1 of
OpenCL was released December 2008 and they started working on 1.1 immediately
after that. It has also been broken almost from the start due to too many
companies controling it (it's designed by a consortium) and trying to solve the
problem for too many scenarios at the same time.

ATI also started supporting OpenCL but I don't have any experience with that.
Their upside is that it also allows compiling cpu versions.

I would start with cuda as the move to OpenCL is very simple afterwords if you
wish and Cuda is easier to start with.

Also note that OpenCL gives you functional portability but not performance
portability. You will not write the same OpenCL code for NVidia, ATI, CPUs etc.
The vectorization should be all different (NVidia discourage vectorization, ATI
require vectorization, SSE requires different vectorization), the memory model
is different, the size of the work groups should be different, etc.

> > Have you considered sharing access with another research lab that has
> > already purchased something similar?
> > (Some vendors may also be willing to let you run your codes in exchange
> > for feedback.)
> 
> There's nobody else at UC Berkeley I know of who has a GPU
> cluster.
> 
> I don't know of any vendor who'd be willing to volunteer
> their cluster. If anybody would like to volunteer, step
> right up.
> 

Are you aware of the NVidia professor partnership program? We got a Tesla S1070
for free from them.

http://www.nvidia.com/page/professor_partnership.html

> > 1) sw thread synchronization chews up processor time
> 
> Right, but let's say right now 80% of the CPU time is spent
> in routines that will eventually be done in the GPU (I'm
> just making this number up). I don't see how having a faster
> CPU would help overall.
>

My experience is that unless you wish to write hybrid code (code that partly
runs on the GPU and partly on the CPU in parallel to fully utilize the system)
you don't need to care too much about the CPU power.

Note that the Cuda model is asynchronous so you can run code in parallel
between the GPU and CPU.

> > 2) Do you already know if your code has enough computational complexity
> > to outweigh the memory access costs?
> 
> In general, yes. A couple of grad students have ported some
> of their code to CUDA with excellent results. Plus, molecular
> dynamics is well suited to GPU programming, or so I'm told.
> Several of the popular opensource MD packages have already
> been ported also with excellent results.
> 

The issue is not only computation complexity but also regular memory accesses.
Random memory accesses on the GPU can seriously kill you performance.

Also note that until fermi comes out the double precision performance is
horrible. If you can't use single precision then GPUs are probably not for you
at the moment. Double precision on g200 is around an 1/8 of single precision
and g80/g90 don't have double precision at all.

Fermi improves that by finally providing double precision running an 1/2 the
single precision speed (basically combining two FPUs into on double precision
unit).

> > 3) Do you know if the GTX275 has enough vram? Your benchmarks will
> > suffer if you start going to gart and page faulting
> 

You don't have page faulting on the GPU, GPUs don't have virtual memory. If you
don't have enough memory the allocation will just fail.

> The one I mentioned in my posting has 1.8GB of RAM. If this isn't
> enough then we're in trouble. The grad student I mentioned
> has been using the 898MB version of this card without problems.
> 
> > 4) I can tell you 100% that not all gpu are created equally when it
> > comes to handling cuda code. I don't have experience with the GTX275,
> > but if you do hit issues I would be curious to hear about them.
> 
> I've heard that it's much better than the 9500GT that we first
> started using. Since the 9500GT is a much cheaper card we didn't expect
> much performance out of it, but the grad student who was trying
> to use it said that there were problems with it not releasing memory,
> resulting in having to reboot the host. I don't know the details.
> 

I don't have any issues with releasing memory. The big differences are between
the g80/g90 series (including the 9500GT) which is a 1.1 Cuda model and the
g200 which uses the 1.3 cuda model.

Memory handling is much better on the 1.3 GPUs (memory accesses for fully
utilizing the memory bandwidth are much more lenient). The g200 also has double
precision support (although at about 1/8 the speed of single precision). There
is also more support for atomic operations and a few other differences,
although the biggest difference is the memory bandwidth utilization.

Don't bother with the 8000 and 9000 for HPC and Cuda. Cheaper for learning but
not so much for deployment.

> > Some questions in return..
> > Is your code currently C, C++ or Fortran?
> 
> The most important program for this group is in Fortran.
> We're going to keep it in Fortran, but we're going to
> write C interfaces to the routines that will run on
> the GPU, and then write these routines in C.
> 

You may want to look into the pgi compiler. They introduced Cuda support for
Fortran, I believe since November.
http://www.pgroup.com/resources/cudafortran.htm

> > Is there any interest in optimizations at the compiler level which could
> > benefit molecular dynamics simulations?
> 
> Of course, but at what price? I'm talking both about
> both the price in dollars, and the price in non-standard
> directives.
> 
> I'm not a chemist so I don't know what would speed up MD calculations
> more than a good GPU.
> 

On the cpu side you can utilize SSE. You can also use single precision on the
CPU along with SSE and good cache utilization to greatly speed up things also
on the CPU.

My personal experience though is that it's much harder to use such optimization
on the CPU than on the GPU for most problems.

> Cordially,