[Beowulf] NVIDIA GPUs, CUDA, MD5, and "hobbyists"

Fri Jun 20 07:42:09 PDT 2008

Craig [et.al], this is also how I understand.  One could realistically
wrap standard MPI calls to do this for you: 

MPI_GPU_Bcast(...){

   malloc_some_stuff
   pull_mem_from_gpu
   MPI_Bcast(...)
   free_some_stuff
}

...just a random thought though...

On Fri, 2008-06-20 at 08:13 -0600, Craig Tierney wrote:
> Kilian CAVALOTTI wrote:
> > On Thursday 19 June 2008 04:32:11 pm Chris Samuel wrote:
> >> ----- "Kilian CAVALOTTI" <kilian at stanford.edu> wrote:
> >>> AFAIK, the multi GPU Tesla boxes contain up to 4 Tesla processors,
> >>> but are hooked to the controlling server with only 1 PCIe link,
> >>> right? Does this spell like "bottleneck" to anyone?
> >> The nVidia website says:
> >>
> >> http://www.nvidia.com/object/tesla_tech_specs.html
> >>
> >> # 6 GB of system memory (1.5 GB dedicated memory per GPU)
> > 
> > The latest S1070 has even more than that: 4GB per GPU as it seems, 
> > according to [1].
> > 
> > But I think this refers to the "global memory", as decribed in [1] 
> > (slide 12, "Kernel Memory Access"). It's the graphics card main memory, 
> > the kind of one which is used to store textures in games, for instance. 
> > Each GPU core also has what they call "shared memory" and which is 
> > really only shared between threads on the same core (it's more like a 
> > L2 cache actually).
> > 
> >> So my guess is that you'd be using local RAM not the
> >> host systems RAM whilst computing.
> > 
> > Right, but at some point, you do need to transfer data from the host 
> > memory to the GPU memory, and back. That's where there's probably a 
> > bottleneck if all 4 GPUs want to read/dump data from/to the host at the 
> > same time.
> > 
> > Moreover, I don't think that the different GPUs can work together, ie. 
> > exchange data and participate to the same parallel computation. Unless 
> > they release something along the lines of a CUDA-MPI, those 4 GPUs 
> > sitting in the box would have to be considered as independent 
> > processing units. So as I understand it, the scaling benefits from your 
> > application's parallelization would be limited to one GPU, no matter 
> > how many you got hooked to your machine.
> > 
> 
> You can integrate MPI with CUDA and create parallel applications.  CUDA is
> just a preprocessor that uses the local C compiler (gcc for Linux by default).
> I have seen some messages on the MVAPICH mailing list talking about users
> doing this.
> 
> Since the memory is on the card, you have to transfer back to the host
> before you can send it via an MPI call.  However, if your entire model
> can fit in the GPU's memory (which is why the 4GB S1070 Tesla card is
> useful), then you should be able to pull down the portion of memory
> from the GPU you want to send out, then send it.
> 
> Or at least that's how I understand it.  When I get my systems I will
> get to figure out the "real" details.
> 
> Craig
> 
> 
> 
> 
> 
> > I don't even know how you choose (or even if you can choose) on which 
> > GPU you want your code to be executed. It has to be handled by the 
> > driver on the host machine somehow.
> > 
> >> There's a lot of fans there..
> > 
> > They probably get hot. At least the G80 do. They say "Typical Power 
> > Consumption: 700W" for the 4 GPUs box. Given that a modern gaming rig 
> > featuring a pair of 8800GTX in SLI already requires a 1kW PSU, I would 
> > put this on the optimistic side.
> > 
> > [1]http://www.nvidia.com/object/tesla_s1070.html
> > [2]http://www.mathematik.uni-dortmund.de/~goeddeke/arcs2008/C1_CUDA.pdf
> > 
> > 
> > Cheers,
> 
>