[Beowulf] GPU based cluster fo machine learning purposes

Mark Hahn hahn at mcmaster.ca
Wed Apr 9 22:44:30 PDT 2014

> I'm considering proof of concept Beowulf cluster build for machine
> learning purposes.

you can't go wrong using cheap/PC/commodity parts.  you'll also get 
the easiest access to tools/distros/etc.

> My main requirements are that it should based on
> embedded development boards (relatively small power consumption and
> size).

hmm.  I don't think you get any free lunch.  embedded systems will
have relatively less powerful CPUs (possibly not a problem),
but will otherwise spend roughly the same power on the other two
power-consuming components (ram and gpu).

> In short I need as good as possible double precision matrix
> multiplication performance with small power consumption and size.

TK1 appears to be SP-oriented (not surprisingly).  it's a little 
unclear what its power dissipation is - I'd guess something in the 
20W range for linpack.

> Taking matrix multiplication into consideration I thought that GPU is
> natural choice.

well, maybe.  you always save power by operating more units at lower 
clock, and GPU tends to embrace this approach.  it's not like GPUs 
have some magically more efficient circuits otherwise.  but it's 
proabably worth looking at the gpu-linpack performance/watt from 
AMD's APU options.  (though they contain higher-performance CPU and 
memory support than TK1.)

> If I missed something then please let me know.

just that there's no free lunch.  it's only got one 192-wide vector unit,
and a desktopy SP-tuned one at that, so it's not really in the same category
as high-end tesla stuff.

> curious about your professional opinion on this build.

my professional opinion is that when people use the phrase "build"
as a noun, they're coming from the PC/gamer world ;)


> Questions that already came to my mind:
> 1. What are the most used diagnostic software for keeping cluster up and
> running.

what failure modes are you thinking about?  I use IPMI on my clusters,
and wouldn't build a cluster without it.

> Is it something that I should incorporate from outside of
> standard distro (like Debian/Ubuntu) repository for this kind of build ?
> Or maybe standard tools are enough ?

HPC-wise, the only thing wrong with distros is that they tend to omit
vendor-tuned blas/lapack libs.

> 2. Boards got size 5"x5" (12.7cmx12.7cm) I wonder where to find
> chassis/open air frame for 16, 32 or 64 nodes if I will have to extend
> my build. If you have any proposition I would be glad to hear about it.

woodshop?  I'd probably just tie-wrap the boards to a frame.

> 3. I'm not electrical engineer but I wonder if there could be problem
> with powering up 32/64 nodes at once. There are no wattage
> characterization data for this board right now, but I saw some
> informations that this board should be sub-10W.

the nvidia page says "in the range of 5W for real workloads", which is 
pretty weasely.  the board appears to have a PC-style molex connector,
but some specs say it takes just 12VDC (molex also has 5VDC).  in either
case, it would seem very easy go gang many boards together, especially
since there's no smart/PSU-control.

> 4. Theoretical max for this platform is 326 SP GFLOPS, I was able to
> confirm that DP/SP ratio is 1/24 so theoretical max for DP is 13 GFLOPS.
> Can someone elaborate or point me to documentation how hard will be to
> utilize this power assuming CUDA and MPI usage.

"utilize"?  it's pretty low flops, so the onboard 2G will be plenty
to keep it busy.  otoh, the memory is only 64b wide (no mention of 
memory clock I've seen), so probably fairly low-bandwidth.

> 5. Operating system reside on eMMC, are there any reasons to switch to
> SD card or SSD disk (there is a SATA port on board) ?

none.  or even boot over the net.

> I'm open to any suggestions, even if it means changing everything in
> this build :)

IMO, you can learn everything you need to learn from 4-8 low-end PCs.
there are certainly power differences versus and arm+low-end-gpu board
like this, but since this device delivers pretty much token gflops,
you might consider just using a raspberry pi or beaglebone if you 
have your heart set on avoiding the PC market.

regards, mark hahn.

More information about the Beowulf mailing list