[Beowulf] NVIDIA GPUs, CUDA, MD5, and "hobbyists"

Fri Jun 20 07:13:52 PDT 2008

Kilian CAVALOTTI wrote:
> On Thursday 19 June 2008 04:32:11 pm Chris Samuel wrote:
>> ----- "Kilian CAVALOTTI" <kilian at stanford.edu> wrote:
>>> AFAIK, the multi GPU Tesla boxes contain up to 4 Tesla processors,
>>> but are hooked to the controlling server with only 1 PCIe link,
>>> right? Does this spell like "bottleneck" to anyone?
>> The nVidia website says:
>>
>> http://www.nvidia.com/object/tesla_tech_specs.html
>>
>> # 6 GB of system memory (1.5 GB dedicated memory per GPU)
> 
> The latest S1070 has even more than that: 4GB per GPU as it seems, 
> according to [1].
> 
> But I think this refers to the "global memory", as decribed in [1] 
> (slide 12, "Kernel Memory Access"). It's the graphics card main memory, 
> the kind of one which is used to store textures in games, for instance. 
> Each GPU core also has what they call "shared memory" and which is 
> really only shared between threads on the same core (it's more like a 
> L2 cache actually).
> 
>> So my guess is that you'd be using local RAM not the
>> host systems RAM whilst computing.
> 
> Right, but at some point, you do need to transfer data from the host 
> memory to the GPU memory, and back. That's where there's probably a 
> bottleneck if all 4 GPUs want to read/dump data from/to the host at the 
> same time.
> 
> Moreover, I don't think that the different GPUs can work together, ie. 
> exchange data and participate to the same parallel computation. Unless 
> they release something along the lines of a CUDA-MPI, those 4 GPUs 
> sitting in the box would have to be considered as independent 
> processing units. So as I understand it, the scaling benefits from your 
> application's parallelization would be limited to one GPU, no matter 
> how many you got hooked to your machine.
> 

You can integrate MPI with CUDA and create parallel applications.  CUDA is
just a preprocessor that uses the local C compiler (gcc for Linux by default).
I have seen some messages on the MVAPICH mailing list talking about users
doing this.

Since the memory is on the card, you have to transfer back to the host
before you can send it via an MPI call.  However, if your entire model
can fit in the GPU's memory (which is why the 4GB S1070 Tesla card is
useful), then you should be able to pull down the portion of memory
from the GPU you want to send out, then send it.

Or at least that's how I understand it.  When I get my systems I will
get to figure out the "real" details.

Craig

> I don't even know how you choose (or even if you can choose) on which 
> GPU you want your code to be executed. It has to be handled by the 
> driver on the host machine somehow.
> 
>> There's a lot of fans there..
> 
> They probably get hot. At least the G80 do. They say "Typical Power 
> Consumption: 700W" for the 4 GPUs box. Given that a modern gaming rig 
> featuring a pair of 8800GTX in SLI already requires a 1kW PSU, I would 
> put this on the optimistic side.
> 
> [1]http://www.nvidia.com/object/tesla_s1070.html
> [2]http://www.mathematik.uni-dortmund.de/~goeddeke/arcs2008/C1_CUDA.pdf
> 
> 
> Cheers,

-- 
Craig Tierney (craig.tierney at noaa.gov)