[Beowulf] GPU question

Micha Feigin michf at post.tau.ac.il
Tue Sep 1 01:58:40 PDT 2009

On Mon, 31 Aug 2009 12:28:43 -0400
Gus Correa <gus at ldeo.columbia.edu> wrote:

> Hi Amjad
> 1. Beware of hardware requirements, specially on your existing
> computers, which may or may not fit a CUDA-ready GPU.
> Otherwise you may end up with a useless lemon.
> A) Not all NVidia graphic cards are CUDA-ready.
> NVidia has lists telling which GPUs are CUDA-ready,
> which are not.

All newer cards are CUDA ready but with different levels of support.

basically everything from geforce 8000 and up will support cuda (even the 40$
cards by the way).

g80/g90 cards (geforce 8000 and 9000 series) will only do single precision,
have relatively small memory (256-768 mb), have stricter coalescing
requirements (working with main card memory efficiently is harder), have
limited atomic operation support

g200 cards (geforce 200 series) do double precision, can reach a lot more
memory (1gb on the g285, 1800mb on the g295), have  more cores and higher
memory bandwidth and have much better atomic operation support. the g285 has
240 cores, the g295 has to GPUs on board giving 480 cores. These are rather
nice for development if you want to work on your main card in your pc and have
a low budget. They are not targeted for hpc by NVidia by the way.

tesla, this is basically almost the same as the g285 (240 cores) but has 4gb
memory and no graphics output. You can get 4 of these in a dedicated rack mount
(s1070). They are designed to run cuda, although unofficially the support
opengl according to the guy in charge of them at NVidia. They are much more
expensive than the g285 though.

quadro, these are designed for business graphics for render farms, CAD and high
dynamic range work. They have support for high dynamic range imaging and anti
aliasing in hardware and are the only cards by nvidia that support gpu affinity
for opengl (for glsl). There are unofficial hacks to achieve this if I'm not mistaken
under linux with the g200 but no way to do it under windows. There is also a
rack mount to connect four of these to one PC (quadro plex)

Another thing to note is that the teslas and quadros are manufactures and
supported by nvidia and designed for 24/7 deployment. For the geforce series,
nvidia makes the GPU chip but not the cards and they get very angry and
offended if you suggest putting one of these into a deployment system.

> B) Check all the GPU hardware requirements in detail: motherboard,
> PCIe version and slot, power supply capacity and connectors, etc.
> See the various GPU models on NVidia site, and
> the product specs from the specific vendor you choose.
> C) You need a free PCIe slot, most likely 16x, IIRR.

I couldn't find any cuda supported card that works on anything less than pci-e
x 16. I found reference to one of the cheap cards (don't remember which one, I
thing geforce 8400) that has a version for something less, but could actually
find one.

> D) Most GPU card models are quite thick, and take up its
> own PCIe slot and cover the neighbor slot, which cannot be used.
> Hence, if your motherboard is already crowded, make sure
> everything will fit.

The high end ones take two slots, one for the cooling, I thing that the g295
actually takes 3 slots if memory serves.

> For rackmount a chassis you may need at least 2U height.
> On a tower PC chassis this shouldn't be a problem.
> You may need some type of riser card if you plan to mount the GPU
> parallel to the motherboard.

You also need appropriate cooling to take care of the ridiculous amount of heat.

> E) If I remember right, you need PCIe version 1.5 (?)
> or version 2 on your motherboard.

Most cards are pci-e 2

> F) You also need a power supply with enough extra power to feed
> the GPU beast.
> The GPU model specs should tell you how much power you need.
> Most likely a 600W PS or larger, specially if you have a dual socket
> server motherboard with lots of memory, disks, etc to feed.

They also take their own power input from the psu (two to three inputs) for
additional power unlike the cards that take all their power from the pci-e slot
so you need a supported psu. The psu also needs to be strong enough (around
200w per cards, the g285 says you need at least 550w for a single card)

> G) Depending on the CUDA-ready GPU card,
> the low end ones require 6-pin PCIe power connectors
> from the power supply.
> The higher end models require 8-pin power supply PCIe connectors.
> You may find and buy molex-to-PCIe connector adapters also,
> so that you can use the molex (i.e. ATA disk power connectors)
> if your PS doesn't have the PCIe connectors.
> However, you need to have enough power to feed the GPU and the system,
> no matter what.
> ***
> 2. Before buying a lot of hardware, I would experiment first with a
> single GPU on a standalone PC or server (that fits the HW requirements),
> to check how much programming it takes,
> and what performance boost you can extract from CUDA/GPU.
> CUDA requires  quite a bit of logistics of
> shipping data between memory, GPU, CPU,
> etc.
> It is perhaps more challenging to program than, say,
> parallelizing a serial program with MPI, for instance.
> Codes that are heavy in FFTs or linear algebra operations are probably
> good candidates, as there are CUDA libraries for both.

There is a steep learning curve as you need to understand the hardware to get
the most out of your code.  I find it easier to code than mpi but I guess that
is personal. You code is usually good for CUDA if you need to do the same thing
a lot of times. If you code has a lot of logic (a lot of if clauses), a lot of
atomic operations or complex data structures that it probably won't transfer
well. Also complex (non-ordered) memory reading/writing paradigms can have a
significant performance hit (reading is easier to cope with that writing)

> At some point only 32-bit floating point arrays would take advantage of
> CUDA/GPU, but not 64-bit arrays.
> The latter would
> require additional programming to change between 64/32 bit
> when going to and coming back from the GPU.
> Not sure if this still holds true,
> newer GPU models may have efficient 64-bit capability,
> but it is worth checking this out, including if performance for
> 64-bit is as good as for 32-bit.

g200 and up (including tesla and quadro) have 64bit floating point support but
it is much less efficient than 32bit. If memory serves it's a ratio of about
1:5. What nvidia calls cores are actually FPUs. these FPUs are 32bit and it
combines these somehow to get 64bit arithmetic. ATI are better at 64bit
arithmeric performance but ati streams are much more limited, are much harder to
code and the documentation is VERY scarce. If you use glsl and double precision
it's better to go with ATI. Maybe once OpenCL is mature enough it will also be
an option but the g300 will probably be on the market by then and they may
change the balance.

geforce 8000 and 9000 only do single precission

> 3. PGI compilers version 9 came out with "GPU directives/pragmas"
> that are akin to the OpenMPI directives/pragmas,
> and may simplify the use of CUDA/GPU.
> At least before the promised OpenCL comes out.
> Check the PGI web site.
> Note that this will give you intra-node parallelism exploring the GPU,
> just like OpenMP does using threads on the CPU/cores.

I saw one of these, don't remember if it was PGI. No experience with it though.
>From CUDA exprience though, there would be a lot of things that it would be
hard for such a compiler to achieve.

BTW, there is also a matlab toolbox called jacket from accelereyes that allows
you to do cuda from matlab. The numbers are not as good as they advertize
(truth in advertizing, they also provide an OpenGL visualization functions and
for their code they use opengl and for the matlab version they use surf, from
testing the difference is huge)

> 4. CUDA + MPI may be quite a challenge to program.
> I hope this helps,
> Gus Correa
> amjad ali wrote:
> > Hello all, specially Gil Brandao
> > 
> > Actually I want to start CUDA programming for my |C.I have 2 options to do:
> > 1) Buy a new PC that will have 1 or 2 CPUs and 2 or 4 GPUs.
> > 2) Add 1 GPUs to each of the Four nodes of my PC-Cluster.
> > 
> > Which one is more "natural" and "practical" way?
> > Does a program written for any one of the above will work fine on the 
> > other? or we have to re-program for the other?
> > 
> > Regards.
> > 
> > On Sat, Aug 29, 2009 at 5:48 PM, <madskaddie at gmail.com 
> > <mailto:madskaddie at gmail.com>> wrote:
> > 
> >     On Sat, Aug 29, 2009 at 8:42 AM, amjad ali<amjad11 at gmail.com
> >     <mailto:amjad11 at gmail.com>> wrote:
> >      > Hello All,
> >      >
> >      >
> >      >
> >      > I perceive following computing setups for GP-GPUs,
> >      >
> >      >
> >      >
> >      > 1)      ONE PC with ONE CPU and ONE GPU,
> >      >
> >      > 2)      ONE PC with more than one CPUs and ONE GPU
> >      >
> >      > 3)      ONE PC with one CPU and more than ONE GPUs
> >      >
> >      > 4)      ONE PC with TWO CPUs (e.g. Xeon Nehalems) and more than
> >     ONE GPUs
> >      > (e.g. Nvidia C1060)
> >      >
> >      > 5)      Cluster of PCs with each node having ONE CPU and ONE GPU
> >      >
> >      > 6)      Cluster of PCs with each node having more than one CPUs
> >     and ONE GPU
> >      >
> >      > 7)      Cluster of PCs with each node having ONE CPU and more
> >     than ONE GPUs
> >      >
> >      > 8)      Cluster of PCs with each node having more than one CPUs
> >     and more
> >      > than ONE GPUs.
> >      >
> >      >
> >      >
> >      > Which of these are good/realistic/practical; which are not? Which
> >     are quite
> >      > “natural” to use for CUDA based programs?
> >      >
> > 
> >     CUDA is kind of new technology, so I don't think there is a "natural
> >     use" yet, though I read that there people doing CUDA+MPI and there are
> >     papers on CPU+GPU algorithms.
> > 
> >      >
> >      > IMPORTANT QUESTION: Will a cuda based program will be equally
> >     good for
> >      > some/all of these setups or we need to write different CUDA based
> >     programs
> >      > for each of these setups to get good efficiency?
> >      >
> > 
> >     There is no "one size fits all" answer to your question. If you never
> >     developed with CUDA, buy one GPU an try it. If it fits your problems,
> >     scale it with the approach that makes you more comfortable (but
> >     remember that scaling means: making bigger problems or having more
> >     users). If you want a rule of thumb: your code must be
> >     _truly_parallel_. If you are buying for someone else, remember that
> >     this is a niche. The hole thing is starting, I don't thing there isn't
> >     many people that needs much more 1 or 2 GPUs.
> > 
> >      >
> >      > Comments are welcome also for AMD/ATI FireStream.
> >      >
> > 
> >     put it on hold until OpenCL takes of  (in the real sense, not in
> >     "standards papers" sense), otherwise you will have to learn another
> >     technology that even fewer people knows.
> > 
> > 
> >     Gil Brandao
> > 
> > 
> > 
> > ------------------------------------------------------------------------
> > 
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list