[Beowulf] GP-GPU experience
diep at xs4all.nl
Mon Apr 4 13:07:31 PDT 2011
On Apr 4, 2011, at 6:54 PM, Massimiliano Fatica wrote:
> If you are old enough to remember the time when the first distribute
> computers appeared on the scene,
> this is a deja-vu. Developers used to program on shared memory (
> mostly with directives) were complaining
> about the new programming models ( PVM, MPL, MPI).
> Even today, if you have a serial code there is no tool that will make
> your code runs on a cluster.
> Even on a single system, if you try an auto-parallel/auto-vectorizing
> compiler on a real code, your results will probably be disappointing.
> When you can get a 10x boost on a production code rewriting some
> portions of your code to use the GPU, if time to solution is important
Oh comeon factor 10 is not realistic.
You're doing the usual compare here of a hobby coder who coded
a tad in C or slowish C++ (except for a SINGLE, so not several,
NCSA coder i'll have to find the first C++ guy who can write codes
equally fast to C for complex algorithms - granted for big companies
C++ makes more sense, just not when it's about performance)
and then compare that with a full blown sponsored project in CUDA
that uses the topend gpu and compare it versus a single core
instead of 4 sockets (as that's powerwise the same).
Moneywise of course is another issue, that's where the gpu's win it
Yet there is a hidden cost in gpu's, that's you can build something way
faster for less money with gpu's, but you also need to pay for a good
coder to write your code in either CUDA or AMD-CAL (or as the chinese
seem to support both at the same time, which is not so complicated
if you have setup things in the correct manner).
This last is a big problem for the western world; governments pay big
bucks for hardware, but paying good coders what they are worth they
seem to forget.
Secondly there is another problem, that's that NVIDIA hasn't even
the instructoin set of their GPU. Try to figure that out without
fulltime work for it.
It seems however pretty similar to AMD, despite other huge architectural
differences between the 2; the programming similarity is striking and
the real purpose where they got designed for (GRAPHICS).
> or you could perform simulations that were impossible before ( for
> example using algorithms that were just too slow on CPUs,
All true yet it takes a LOT OF TIME to write something that's fast on
First of all you have to not write double precision code, as the
from nvidia seem to not have much double precision logic, they only have
32 bits logics.
So at double precision, AMD is like 10 times faster in money per
gflop than Nvidia.
Yet try to figure that out without being fulltime busy with those gpu's.
Only the TESLA versions have those transistors it seems.
Secondly Nvidia seems to keep being busy maximizing the frequency of
Now that might be GREAT for games as high clocked cores work (see
yet for throughput of course that's a dead end. In raw throughput
approach will always win it of course from nvidia, as clocking a
higher has a O ( n ^ 3 ) impact on power consumption.
Now a big problem with nvidia is also that they basically go over spec.
I didn't really figure it out, yet it seems pci-e got designed with
300 watt in mind max.
Yet at this code i'm busy with, the CUDA version of it (mfaktc)
consumes a whopping 400+ watt
and please realize that majority of the system time is only keeping
the streamcores busy
and not caches at all nor much of a RAM.
It's only doing multiplications of course at full speed in 32 bits
code, using the new Fermi's
instructions that allows multiplying 32 bits x 32 bits == 64 bits.
CUDA version of your code gets developed btw by a guy working for a
which, i guess, also sells those Tesla's.
So any performance bragging sure must keep in mind it's far over 33%
over the specs in
terms of power consumption.
Note AMD seems to follow nvidia in its path there.
> Discontinuous Galerkin method is a perfect example), there are a lot
> of developers that will write the code.
Oh comeon, writing for gpu's is really complicated.
> The effort it is clearly dependent of the code, the programmer and the
> tool used ( you can go from fully custom GPU code with CUDA or OpenCL,
Forget OpenCL, not good enough.
Better to code in CUDA and AMD-CAL at the same time something.
> to automatically generated CUF kernels from PGI, to directives using
> HMPP or PGI Accelerator).
> In situation where time to solution relates to money, for example
> oil and gas, GPUs are the answer today ( you will be surprised
> by the number of GPUs in Houston).
Pardon me, those industries already were using vectorized solutoins
long before CUDA was
there and are using massively GPU's to calculate of course as soon as
a version that was programmable.
This is not new. All those industries will of course never say
anything on the performance
nor how many they use.
> Look at the performance and scaling of AMBER ( MPI+ CUDA),
> http://ambermd.org/gpus/benchmarks.htm, and tell me that the results
> were not worth the effort.
> Is GPU programming for everyone: probably not, in the same measure
> that parallel programming in not for everyone.
> Better tools will lower the threshold, but a threshold will be
> always present.
I would argue that both AMD as well as Nvidia has really tried to
give the 3d world nations an advantage
by stopping progress in the rich nations.
I will explain. The real big advantage of rich nations is that
average persons have more cash.
Students are a good example there. They can afford gpu's easily.
Yet there is so little technical information available on latencies
and in case of nvidia on instructoin set that
the gpu's support, that this gives a huge programming hurdle for
Also there is no good tips in nvidia documents how to program for
The most fundamental lessons how to program a gpu i miss in all
documents i scanned so far.
It's just a bunch of 'lectures' that's not going to create any
A piece of information here and a tad there.
AMD also is a nightmare there, they can't even run more than 1
program at the same time, despite claims
that the 4000 series gpu's already had hardware support to do it. The
indian helpdesk in fact is so lazy that
they didn't even rename the word 'ati' in the documentation to AMD,
and the library each few months gets a
new name. Stream SDK now it's another new fancy name. "we worked hard
in India sahib, yes sahib, yes sahib".
Yet 5 years later still not much works. For example in opencl also
the 2nd gpu doesn't work in case of AMD.
Result "undefined". Nice.
Default driver install at inux here doesn't get openCL to work in
fact at the 6970.
Both nvidia as well as AMD are a total joke there and by means of
the generic incompetence being complete and clear documentation just
like we have documention on how
cpu's work. Be it intel or AMD or IBM.
Students who program now for those gpu's in CUDA or AMD-CAL, they
will have to go to hell and back to get something
to work well on it, except some trivial stuff that works well at it.
We see that just a few manage.
That's not a problem of the students, but a problem for society,
because doing calculations faster and especially
CHEAP, is a huge advantage to progress science.
NSA type organisations in 3d world nations are a lot bigger than
here, simply because more people live there.
So right now more people over there code for gpu's than here, here
where everyone can afford one.
Some big companies excepted of course, but this is not a small note
on companies. This is a note on 1st world versus 3d
world. The real difference is students with budget over here.
They have budget for gpu's, yet there is no good documentation simply
giving which instructions a gpu has let alone which
If you google hard, you will find 1 guy who actually by means of
measuring had to measure the latencies of simple
instructions that write to the same register. Why did an university
guy need to measure this, why isn't this simply
in Nvidia documentation?
A few of those things will of course have majority, vaste vaste
majority of students trying something on a gpu, completely fail.
Because they fail, they don't continue there and don't get back from
those gpu's a faster running code that gives them
something very important: faster calculation speed for whatever they
wanted to run.
This is where AMD and Nvidia, and i politely call it by means of
incompetence, gives the rich nations no advantage
over the 3d world nations, as the students need to be compeltely
fulltime busy to obtain knowledge on the internal workings
of the gpu's in order to get something going fast at them. Majority
will fail therefore of course, which has simply avoided
gpu's from getting massively adapted.
I've seen so many students try and fail at gpu programming,
It's bizarre. The fail % is so huge. Even a big succes doesn't get
recognized as a big succes,
simply because the guy didn't know about a few bottlenecks in gpu
programming, as no manual told him
the combination of problems he ran into, as there was no technical
It is true gpu's can be fast, but i feel there is a big need for
better technical documentation of them.
We can no longer ignore this now that 3d world nations are
overrunning 1st world nations. Mainly
because the sneaky organisations that do know everything are of
course bigger over there than here, by means of
population size. This where the huge advantage of the rich nations,
namely that every student has such gpu
at home, is not getting taken advantage from as the hurdle to gpu
programming is too high by means of lack of
accurate documentation. Of course in 3d world nations they have at
most a mobile phone, and very very seldom a laptop (except for the
rich elite), let alone a computer with a capable programmable gpu,
which makes it impossible for majority
of 3d world nations students to do any gpu computation because of a
shortage in cash.
> PS: Full disclosure, I work at Nvidia on CUDA ( CUDA Fortran,
> applications porting with CUDA, MPI+CUDA).
> 2011/4/4 "C. Bergström" <cbergstrom at pathscale.com>:
>> Herbert Fruchtl wrote:
>>> They hear great success stories (which in reality are often
>>> implementations that do one carefully chosen benchmark well),
>>> then look at the
>>> API, look at their existing code, and postpone the start of their
>>> project until
>>> they have six months spare time for it. And we know when that is.
>>> The current approach with more or less vendor specific libraries
>>> (be they "open"
>>> or not) limits the uptake of GPU computing to a few hardcore
>>> developers of
>>> experimental codes who don't mind rewriting their code every two
>>> years. It won't
>>> become mainstream until we have a compiler that turns standard
>>> Fortran (or C++,
>>> if it has to be) into GPU code. Anything that requires more
>>> change than let's
>>> say OpenMP directives is doomed, and rightly so.
>> Hi Herbert,
>> I think your perspective pretty much nails it
>> (shameless self promotion)
>> http://www.pathscale.com/ENZO (PathScale HMPP - native codegen)
>> http://www.caps-entreprise.com/hmpp.html (CAPS HMPP - source to
>> This is really only the tip of the problem and there must also be
>> solutions for scaling *efficiently* across the cluster. (No MPI +
>> or even HMPP is *not* the answer imho.)
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>> To change your subscription (digest mode or unsubscribe) visit
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> To change your subscription (digest mode or unsubscribe) visit
More information about the Beowulf