[Beowulf] NVIDIA GPUs, CUDA, MD5, and "hobbyists"
John Leidel
john.leidel at gmail.com
Fri Jun 20 07:42:09 PDT 2008
Craig [et.al], this is also how I understand. One could realistically
wrap standard MPI calls to do this for you:
MPI_GPU_Bcast(...){
malloc_some_stuff
pull_mem_from_gpu
MPI_Bcast(...)
free_some_stuff
}
...just a random thought though...
On Fri, 2008-06-20 at 08:13 -0600, Craig Tierney wrote:
> Kilian CAVALOTTI wrote:
> > On Thursday 19 June 2008 04:32:11 pm Chris Samuel wrote:
> >> ----- "Kilian CAVALOTTI" <kilian at stanford.edu> wrote:
> >>> AFAIK, the multi GPU Tesla boxes contain up to 4 Tesla processors,
> >>> but are hooked to the controlling server with only 1 PCIe link,
> >>> right? Does this spell like "bottleneck" to anyone?
> >> The nVidia website says:
> >>
> >> http://www.nvidia.com/object/tesla_tech_specs.html
> >>
> >> # 6 GB of system memory (1.5 GB dedicated memory per GPU)
> >
> > The latest S1070 has even more than that: 4GB per GPU as it seems,
> > according to [1].
> >
> > But I think this refers to the "global memory", as decribed in [1]
> > (slide 12, "Kernel Memory Access"). It's the graphics card main memory,
> > the kind of one which is used to store textures in games, for instance.
> > Each GPU core also has what they call "shared memory" and which is
> > really only shared between threads on the same core (it's more like a
> > L2 cache actually).
> >
> >> So my guess is that you'd be using local RAM not the
> >> host systems RAM whilst computing.
> >
> > Right, but at some point, you do need to transfer data from the host
> > memory to the GPU memory, and back. That's where there's probably a
> > bottleneck if all 4 GPUs want to read/dump data from/to the host at the
> > same time.
> >
> > Moreover, I don't think that the different GPUs can work together, ie.
> > exchange data and participate to the same parallel computation. Unless
> > they release something along the lines of a CUDA-MPI, those 4 GPUs
> > sitting in the box would have to be considered as independent
> > processing units. So as I understand it, the scaling benefits from your
> > application's parallelization would be limited to one GPU, no matter
> > how many you got hooked to your machine.
> >
>
> You can integrate MPI with CUDA and create parallel applications. CUDA is
> just a preprocessor that uses the local C compiler (gcc for Linux by default).
> I have seen some messages on the MVAPICH mailing list talking about users
> doing this.
>
> Since the memory is on the card, you have to transfer back to the host
> before you can send it via an MPI call. However, if your entire model
> can fit in the GPU's memory (which is why the 4GB S1070 Tesla card is
> useful), then you should be able to pull down the portion of memory
> from the GPU you want to send out, then send it.
>
> Or at least that's how I understand it. When I get my systems I will
> get to figure out the "real" details.
>
> Craig
>
>
>
>
>
> > I don't even know how you choose (or even if you can choose) on which
> > GPU you want your code to be executed. It has to be handled by the
> > driver on the host machine somehow.
> >
> >> There's a lot of fans there..
> >
> > They probably get hot. At least the G80 do. They say "Typical Power
> > Consumption: 700W" for the 4 GPUs box. Given that a modern gaming rig
> > featuring a pair of 8800GTX in SLI already requires a 1kW PSU, I would
> > put this on the optimistic side.
> >
> > [1]http://www.nvidia.com/object/tesla_s1070.html
> > [2]http://www.mathematik.uni-dortmund.de/~goeddeke/arcs2008/C1_CUDA.pdf
> >
> >
> > Cheers,
>
>
More information about the Beowulf
mailing list