[Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters

Thu Nov 20 11:55:04 PST 2008

Mark Hahn wrote:
>> [shameless plug]
>>
>> A project I have spent some time with is showing 117x on a 3-GPU 
>> machine over a single core of a host machine (3.0 GHz Opteron 2222).  
>> The code is mpihmmer, and the GPU version of it.  See 
>> http://www.mpihmmer.org for more details.  Ping me offline if you need 
>> more info.
>>
>> [/shameless plug]
> 
> I'm happy for you, but to me, you're stacking the deck by comparing to a 
> quite old CPU.  you could break out the prices directly, but comparing 3x

Hmmm... This is the machine the units were hosted in.  The 2222 is not 
"quite old" by my definition of old.  My experience with this code on 
Barcelona has been that it hasn't added much performance.  Will quantify 
this more for you in the future.

> GPU (modern?  sounds like pci-express at least) to a current entry-level 
> cluster node (8 core2/shanghai cores at 2.4-3.4 GHz) be more appropriate.

Hey ... messenger ... don't shoot? :)

We would love to have a Shanghai.  I don't have one in the lab.  I just 
asked AMD for one.  I honestly don't expect it to make much of a difference.

> at the VERY least, honesty requires comparing one GPU against all the cores
> in a current CPU chip.  with your numbers, I expect that would change 

We are not being dishonest, in fact I was responding to the "can't 
really get good performance" thread.  You can.  This code scales 
linearly with the number of cores.  Our mpi version scales linearly 
across compute nodes.

> the speedup from 117 to around 15.  still very respectable.

Look, the performance is good.  The cost to get this performance is very 
low from an acquisition side.  The effort to get performance is 
relatively speaking, quite low.  I want to emphasize this.

It won't work for every code.  There are large swaths of code it won't 
work for.  This is life, and as with all technologies, YMMV.

> I apologize for not RTFcode, but does the host version of hmmer you're 
> comparing with vectorize using SSE?

JP did the vectorization.  Performance was about 60% better than the 
baseline.  I (and Joydeep at AMD) rewrote 30 lines of code and got 2x. 
There are papers referenced on the website that talk about this.

> 
>>> or more generally: fairly small data, accessed data-parallel or with 
>>> very regular and limited sharing, with high work-per-data.
>>
>> ... not small data.  You can stream data.
> 
> can you sustain your 117x speedup if your data is in host memory?

I believe that the databases are being streamed from host ram and disk.

> by small, I meant the on-gpu-card memory, mainly, which is fast but 
> often more limited than host memory.

The database sizes are 3-4GB and growing rapidly.  The tests were 
originally run on GTX260s, which have 1GB ram or less.

> sidebar: it's interesting that ram is incredibly cheap these days,
> and we typically spec a middle-of-the-road machine at 2GB/core.
> but even 4GB/core is not much more expensive, but to be honest,
> the number of users who need that much is fairly small.
> 
>>> GP-GPU tools are currently immature, and IMO the hardware probably 
>>> needs a generation of generalization before it becomes really widely 
>>> used.
>>
>> Hrmm...  Cuda is pretty good.  Still needs some polish, but people can 
>> use it, and are generating real apps from it.  We are seeing pretty 
>> wide use ... I guess the issue is what one defines as "wide".
> 
> Cuda is NV-only, and forces the programmer to face a lot of limits and 
> weaknesses.  at least I'm told so by our Cuda users - things like having 

Er ... ok.  Cuda is getting pretty much all the mind-share.  We have 
asked AMD to support it.  AMD is doing something else, CTM was not 
successful, and I haven't heard what the new strategy is.  OpenCL looks 
like it will be "designed by committee".

> to re-jigger code to avoid running out of registers.  from my perspective,
> a random science prof is going to be fairly put off by that sort of thing
> unless the workload is really impossible to do otherwise.  (compared to 

This is not my experience.

> the traditional cluster+MPI approach, which is portable, scalable and at 
> least short-term future-proof.)

If you go to the site, you will discover that mpihmmer is in fact 
cluster+MPI.  It was extended to include GPU, FPGA, ... .

Again, please don't shoot the messenger.

> 
> thanks, mark.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615