[Beowulf] GPU Beowulf Clusters

Sun Jan 31 10:15:12 PST 2010

Micha Feigin wrote:
> On Sat, 30 Jan 2010 17:30:31 -0800
> Jon Forrest <jlforrest at berkeley.edu> wrote:
>
>   
>> On 1/30/2010 2:52 PM, "C. Bergström" wrote:
>>
>>     
>>> Hi Jon,
>>>
>>> I must emphasize what David Mathog said about the importance of the gpu
>>> programming model.
>>>       
>> I don't doubt this at all. Fortunately, we have lots
>> of very smart people here at UC Berkeley. I have
>> the utmost confidence that they will figure this
>> stuff out. My job is to purchase and configure the
>> cluster.
>>
>>     
>>> My perspective (with hopefully not too much opinion added)
>>> OpenCL vs CUDA - OpenCL is 1/10th as popular, lacks in features, more
>>> tedious to write and in an effort to stay generic loses the potential to
>>> fully exploit the gpu. At one point the performance of the drivers from
>>> Nvidia was not equivalent, but I think that's been fixed. (This does not
>>> mean all vendors are unilaterally doing a good job)
>>>       
>> This is very interesting news. As far as I know, nobody is doing
>> anything with OpenCL in the College of Chemistry around here.
>> On the other hand, we've been following all the press about how
>> it's going to be the great unifier so that it won't be necessary
>> to use a proprietary API such as CUDA anymore. At this point it's too
>> early to doing anything with OpenCL until our colleagues in
>> the Computer Science department have made a pass at it and
>> have experiences to talk about.
>>
>>     
>
> People are starting to work with OpenCL but I don't think that it's ready yet.
> The nvidia implementation is still buggy and not up to par against cuda in
> terms of performance. Code is longer and more tedious (mostly matches the
> nvidia driver model instead of the much easier to use c api). I know that
> although NVidia say that they fully support it, they don't like it too much.
> NVidia techs told me that the performance difference can be about 1:2.
>   
That used to be true, but I thought they fixed that?  (How old is your 
information)
> Cuda exists for 5 years (and another 2 internally in NVidia). Version 1 of
> OpenCL was released December 2008 and they started working on 1.1 immediately
> after that. It has also been broken almost from the start due to too many
> companies controling it (it's designed by a consortium) and trying to solve the
> problem for too many scenarios at the same time.
>   
The problem isn't too many companies.. It was IBM's cell requirements 
afaik.. Thank god that's dead now..
> ATI also started supporting OpenCL but I don't have any experience with that.
> Their upside is that it also allows compiling cpu versions.
>
> I would start with cuda as the move to OpenCL is very simple afterwords if you
> wish and Cuda is easier to start with.
>   
I would start with a directive based approach that's entirely more sane 
than CUDA or OpenCL.. Especially if his code is primarily Fortran.  I 
think writing C interfaces so that you can call the GPU is a maintenance 
nightmare and will not only be time consuming, but will later will make 
optimizing the application *a lot* harder.  (I say this with my gpu 
compiler hat on and more than happy to go into specifics)
> Also note that OpenCL gives you functional portability but not performance
> portability. You will not write the same OpenCL code for NVidia, ATI, CPUs etc.
> The vectorization should be all different (NVidia discourage vectorization, ATI
> require vectorization, SSE requires different vectorization), the memory model
> is different, the size of the work groups should be different, etc.
>   
Please look at HMPP and see if it may solve this..
>   
>>> Have you considered sharing access with another research lab that has
>>> already purchased something similar?
>>> (Some vendors may also be willing to let you run your codes in exchange
>>> for feedback.)
>>>       
>> There's nobody else at UC Berkeley I know of who has a GPU
>> cluster.
>>
>> I don't know of any vendor who'd be willing to volunteer
>> their cluster. If anybody would like to volunteer, step
>> right up.
>>
>>     
>
> Are you aware of the NVidia professor partnership program? We got a Tesla S1070
> for free from them.
>
> http://www.nvidia.com/page/professor_partnership.html
>
>   
>>> 1) sw thread synchronization chews up processor time
>>>       
>> Right, but let's say right now 80% of the CPU time is spent
>> in routines that will eventually be done in the GPU (I'm
>> just making this number up). I don't see how having a faster
>> CPU would help overall.
>>
>>     
>
> My experience is that unless you wish to write hybrid code (code that partly
> runs on the GPU and partly on the CPU in parallel to fully utilize the system)
> you don't need to care too much about the CPU power.
>
> Note that the Cuda model is asynchronous so you can run code in parallel
> between the GPU and CPU.
>  
>   
>>> 2) Do you already know if your code has enough computational complexity
>>> to outweigh the memory access costs?
>>>       
>> In general, yes. A couple of grad students have ported some
>> of their code to CUDA with excellent results. Plus, molecular
>> dynamics is well suited to GPU programming, or so I'm told.
>> Several of the popular opensource MD packages have already
>> been ported also with excellent results.
>>
>>     
>
> The issue is not only computation complexity but also regular memory accesses.
> Random memory accesses on the GPU can seriously kill you performance.
>   
I think I mentioned memory accesses.. Are you talking about page faults 
or what specifically?  (My perspective is skewed and I may be using a 
different term.) 
> Also note that until fermi comes out the double precision performance is
> horrible. If you can't use single precision then GPUs are probably not for you
> at the moment. Double precision on g200 is around an 1/8 of single precision
> and g80/g90 don't have double precision at all.
>
> Fermi improves that by finally providing double precision running an 1/2 the
> single precision speed (basically combining two FPUs into on double precision
> unit).
>
>   
>>> 3) Do you know if the GTX275 has enough vram? Your benchmarks will
>>> suffer if you start going to gart and page faulting
>>>       
>
> You don't have page faulting on the GPU, GPUs don't have virtual memory. If you
> don't have enough memory the allocation will just fail.
>   
Whatever you want to label it at a hardware level nvidia cards *do* have 
vram and the drivers *can* swap to system memory.  They use two things 
to deal with this a) hw based page fault mechanism and b) dma copying to 
reduce cpu overhead.  If you try to allocate more that's available on 
the card yes it will probably just fail.  (We are working on the 
drivers)  My point was about what happens between the context switches 
of kernels.
>   
>> The one I mentioned in my posting has 1.8GB of RAM. If this isn't
>> enough then we're in trouble. The grad student I mentioned
>> has been using the 898MB version of this card without problems.
>>
>>     
>>> 4) I can tell you 100% that not all gpu are created equally when it
>>> comes to handling cuda code. I don't have experience with the GTX275,
>>> but if you do hit issues I would be curious to hear about them.
>>>       
>> I've heard that it's much better than the 9500GT that we first
>> started using. Since the 9500GT is a much cheaper card we didn't expect
>> much performance out of it, but the grad student who was trying
>> to use it said that there were problems with it not releasing memory,
>> resulting in having to reboot the host. I don't know the details.
>>
>>     
>
> I don't have any issues with releasing memory. The big differences are between
> the g80/g90 series (including the 9500GT) which is a 1.1 Cuda model and the
> g200 which uses the 1.3 cuda model.
>
> Memory handling is much better on the 1.3 GPUs (memory accesses for fully
> utilizing the memory bandwidth are much more lenient). The g200 also has double
> precision support (although at about 1/8 the speed of single precision). There
> is also more support for atomic operations and a few other differences,
> although the biggest difference is the memory bandwidth utilization.
>
> Don't bother with the 8000 and 9000 for HPC and Cuda. Cheaper for learning but
> not so much for deployment.
>
>   
>>> Some questions in return..
>>> Is your code currently C, C++ or Fortran?
>>>       
>> The most important program for this group is in Fortran.
>> We're going to keep it in Fortran, but we're going to
>> write C interfaces to the routines that will run on
>> the GPU, and then write these routines in C.
>>
>>     
>
> You may want to look into the pgi compiler. They introduced Cuda support for
> Fortran, I believe since November.
> http://www.pgroup.com/resources/cudafortran.htm
>   
Can anyone give positive feedback?  (Disclaimer:  I'm biased, but since 
we are making specific recommendations)
http://www.caps-entreprise.com/fr/page/index.php?id=49&p_p=36
>   
>>> Is there any interest in optimizations at the compiler level which could
>>> benefit molecular dynamics simulations?
>>>       
>> Of course, but at what price? I'm talking both about
>> both the price in dollars, and the price in non-standard
>> directives.
>>
>> I'm not a chemist so I don't know what would speed up MD calculations
>> more than a good GPU.
>>
>>     
>
> On the cpu side you can utilize SSE. You can also use single precision on the
> CPU along with SSE and good cache utilization to greatly speed up things also
> on the CPU.
>
> My personal experience though is that it's much harder to use such optimization
> on the CPU than on the GPU for most problems.
>   
CUDA/OpenCL and friends implicitly identify which areas can be 
vectorized and then explicitly offload them.  You are comparing 
apple/oranges here..