[Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters

Vincent Diepeveen diep at xs4all.nl
Thu Nov 20 09:41:42 PST 2008

On Nov 20, 2008, at 5:39 PM, Jan Heichler wrote:

> Hallo Mark,
> Donnerstag, 20. November 2008, meintest Du:
> >> [shameless plug]
> >> A project I have spent some time with is showing 117x on a 3-GPU  
> machine over
> >> a single core of a host machine (3.0 GHz Opteron 2222).  The  
> code is
> >> mpihmmer, and the GPU version of it.  See http:// 
> www.mpihmmer.org for more
> >> details.  Ping me offline if you need more info.
> >> [/shameless plug]
> MH> I'm happy for you, but to me, you're stacking the deck by  
> comparing to a
> MH> quite old CPU.  you could break out the prices directly, but  
> comparing 3x
> MH> GPU (modern?  sounds like pci-express at least) to a current  
> entry-level
> MH> cluster node (8 core2/shanghai cores at 2.4-3.4 GHz) be more  
> appropriate.
> Instead of benchmarking some CPU vs. some GPU wouldn't it be fairer to
> a) compare systems of similar costs (1k, 2k, 3k EUR/USD)
> b) compare systems with a similar power footprint
> ?
> What does it help that 3 GPUs are 1000x faster than a Asus Eee PC?



The correct comparision is comparing power usage, as that is what is  
'hot' these days.
Just plain cash money compare is not enough. Weird yet true. In 3d  
world nations like for
example China, India power is not a concern at all, not for  
government related tasks either.

The slow adaptation to manycores, even for workloads that would do  
well on them (just in theory),
is definitely limited by portability.

Had some ESA dude on the phone a few days ago. I heard the word  
"portability" just a bit too much.
That's why they do just too much with ugly slow JAVA code. Not fast  
enough at 1 pc?
Put another 100 there.

I was told exactly the same reasoning (portability problem) for other  
projects where i tried to sneak in GPU
computing (regardless which manufacturer). Portability was also the  
KILLER there.

If you write burocratic paper documents then CUDA is not portable and  
never will be of course, as the hardware
is simply different from a CPU.

Yet that code must be portable between oldie Sun, UNIX type machines  
and modern quadcores as well as new GPU
hardware, inc ase you want to introduce GPU's. Not realistic of course.

Just enjoy the speedup i'd say, if you can get it.

They can spend millions on hardware, but not even a couple of  
hundreds of thousands on customized software
to solve the problem of portability by having a plugin that is doing  
the crunching just for gpu's.

Idiotic yet that's the sole truth.

So to speak, manycores will only make it in there when NASA writes a  
big article online bragging how fast their
supercomputing codes are at todays gpu's where they own a 100k from  
to do number crunching.

I would argue for workloads favourable to GPU's, which is just a very  
few as of now,
NVIDIA/AMD is up to 10x faster than a quadcore,  if you know how to  
get it out of the card.

Probably gpgpu for now is the cheap alternative for a few very  
specific tasks of 3d world nations therefore.

May they lead us in the path ahead...

In itself very funny that burocratic reasons (portability) is the  
biggest problem limiting progress.

When you speak to hardware designers about say for example 32 core  
cpu's, they laugh loud.
The only scalable hardware for now at 1 cpu giving a big punch, it  
seems to be manycores.

All those managers simply have put their mind in a big storage bunker  
where alternatives are not allowed in.
Even an economic crisis will not help it. They have to get bombarded  
with actual products that are interesting
to them, that get a huge speedup at GPU's, to start understanding the  
advantage of it.

The few who do understand already, they all keep their stuff so  
secret, and usually guys who are not exactly very
good in parallellization may "try out" the GPU in question. That's  
another recipe for disaster of course.

Logically that they never even get a speedup over a simple quadcore.  
If you compare assembler level
SSE2 (modified intel primitives in SSE2 so you want) with a clumsy  
guy (not in his own thinking) who tries out
the GPU for a few weeks, obviously it is gonna fail.

Something algorithmic optimized for like 20-30 years now for pc type  
hardware, that suddenly must get ported
within a few weeks to GPU. There is not many who can do that.

You need complete different algorithmic approach for that. Something  
that is memory bound CAN get rewritten
to cpu bound. Sometimes even without losing speed. Just because they  
didn't have the luxury of such huge
cpu crunching power, they never tried!

But that optimization step of 20 years is a big limit to GPU's.

Add to it that intel is used to GIVE AWAY hardware to developers.
I'll have to see nvidia do that.

If those same guys as the above guys who failed, have that hardware  
for years at home,
they MIGHT get to some ideas and tell their boss.

It's those reports of those guys currently which adds to the storage  
bunker thinking.

It is wrong to assume that experts can predict the future.


> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list