[Beowulf] GPU Beowulf Clusters

Sat Jan 30 04:31:45 PST 2010

On Thu, 28 Jan 2010 09:38:14 -0800
Jon Forrest <jlforrest at berkeley.edu> wrote:

> I'm about to spend ~$20K on a new cluster
> that will be a proof-of-concept for doing
> GPU-based computing in one of the research
> groups here.
> 
> A GPU cluster is different from a traditional
> HPC cluster in several ways:
> 
> 1) The CPU speed and number of cores are not
> that important because most of the computing will
> be done inside the GPU.
> 

The speed not so much but the number of cores does matter. You should have at
least one core per GPU as the CPU is in charge of scheduling and initiating
memory transfers (and if not setup as DMA also handling the memory transfer).

When latency is an issue (especially jobs with a lot of CPU related scheduling)
the CPU polls the GPU results which can bump CPU usage. Nehalem raises another
issue where there is no north side bus and memory goes via the CPU.

It is recommended BTW, that you have at least the same amount of system memory
as GPU memory, so with tesla it is 4GB per GPU.

> 2) Serious GPU boards are large enough that
> they don't easily fit into standard 1U pizza
> boxes. Plus, they require more power than the
> standard power supplies in such boxes can
> provide. I'm not familiar with the boxes
> that therefore should be used in a GPU cluster.
> 

You use dedicated systems. Either one 1u pizza box for the CPU and a matched 1u
tesla s1070 pizza box which has 4 tesla GPUs
http://www.nvidia.com/object/product_tesla_s1070_us.html
or there are several vendors out there that match two tesla GPU (usually the
tesla m1060 in this case which is a passively cooled version of the c1060) 
http://www.nvidia.com/object/product_tesla_m1060_us.html
to a dual cpu xeon in a 1u system.
you can start here (the links page from nvidia)
http://www.nvidia.com/object/tesla_preconfigured_clusters_wtb.html

There are other specialized options if you want, but most of them aimed at
higher budget clusters.

You can push it in terms of power as each tesla takes 160W, adding to that what
the cpu and the rest of the system requires, a 1000W power supply should do.

The s1070 comes with a 1200w power supply on board.

> 3) Ideally, I'd like to put more than one GPU
> card in each computer node, but then I hit the
> issues in #2 even harder.
> 

You are looking for the tesla s1070 or previously mentioned solutions

> 4) Assuming that a GPU can't be "time shared",
> this means that I'll have to set up my batch
> engine to treat the GPU as a non-sharable resource.
> This means that I'll only be able to run as many
> jobs on a compute node as I have GPUs. This also means
> that it would be wasteful to put CPUs in a compute
> node with more cores than the number GPUs in the
> node. (This is assuming that the jobs don't do
> anything parallel on the CPUs - only on the GPUs).
> Even if GPUs can be time shared, given the expense
> of copying between main memory and GPU memory,
> sharing GPUs among several processes will degrade
> performance.
> 

It doesn't have a swap in/swap out mechanism, so the way it may time share is
by alternating kernels as long as there is enough memory. Shouldn't be done for
HPC (same with CPU by the way due to numa/l2 cache and context switching
issues).

What you would want to do is to setup the cards in exclusive mode and then tell
the users not to choose a card explicitly. The context creation function would
then choose the next available card automatically. You would then with the
tesla s1070 setup the machine as having 4 cores for scheduling.

The processes will be sharing the pci bus though for communications so you may
prefer to setup the system as 1 job per machine or at least a round robin
scheduler.

> Are there any other issues I'm leaving out?
> 

Take note that the s1070 is ~6k$ so you are talking at most two to three
machines here with your budget.

Also don't even think about putting that s1070 anywhere but a server room, or
at least nowhere with users near by as it makes a lot of noise.

> Cordially,