[Beowulf] GPU based cluster fo machine learning purposes

Thu Apr 10 05:28:12 PDT 2014

On Thu, Apr 10, 2014 at 01:44:30AM -0400, Mark Hahn wrote:
> >I'm considering proof of concept Beowulf cluster build for machine
> >learning purposes.
> 
> you can't go wrong using cheap/PC/commodity parts.  you'll also get the
> easiest access to tools/distros/etc.
> 

I'm concerned about cluster size I would like to keep it as small as
possible. Probably some Mini/Nano-ITX board would be good enough to beat
Jetson TK1. I wonder about price for whole setup and its comparison with
Jetson.

> >In short I need as good as possible double precision matrix
> >multiplication performance with small power consumption and size.
> 
> TK1 appears to be SP-oriented (not surprisingly).  it's a little unclear
> what its power dissipation is - I'd guess something in the 20W range for
> linpack.
> 
> >Taking matrix multiplication into consideration I thought that GPU is
> >natural choice.
> 
> well, maybe.  you always save power by operating more units at lower clock,
> and GPU tends to embrace this approach.  it's not like GPUs have some
> magically more efficient circuits otherwise.  but it's proabably worth
> looking at the gpu-linpack performance/watt from AMD's APU options.  (though
> they contain higher-performance CPU and memory support than TK1.)
> 
Very good point! Following your AMD APU advice I found this article:
http://www.anandtech.com/show/7711/floating-point-peak-performance-of-kaveri-and-other-recent-amd-and-intel-chips
I will try to rethink my configuration using AMD APU + Mini/Nano-ITX
board and will see if I can get better result considering performance/price
ratio.

> >curious about your professional opinion on this build.
> 
> my professional opinion is that when people use the phrase "build"
> as a noun, they're coming from the PC/gamer world ;)
> 
> sorry!
:) More PC than gamer, maybe my English is not good enough.
> 
> >Questions that already came to my mind:
> >1. What are the most used diagnostic software for keeping cluster up and
> >running.
> 
> what failure modes are you thinking about?  I use IPMI on my clusters,
> and wouldn't build a cluster without it.
> 
I mean board power on failures, bad blocks, overheating and other
hardware issues.

I don't know any development board with BMC, AFAIK this typically server
component. I agree that remote management ability is very important.

> >4. Theoretical max for this platform is 326 SP GFLOPS, I was able to
> >confirm that DP/SP ratio is 1/24 so theoretical max for DP is 13 GFLOPS.
> >Can someone elaborate or point me to documentation how hard will be to
> >utilize this power assuming CUDA and MPI usage.
> 
> "utilize"?  it's pretty low flops, so the onboard 2G will be plenty
> to keep it busy.  otoh, the memory is only 64b wide (no mention of memory
> clock I've seen), so probably fairly low-bandwidth.
> 
In spec there is information about DDR3L FBGA96, 256Mbit x 16, 933MHz Hynix
H5TC4G63AFR-RDA.

> >I'm open to any suggestions, even if it means changing everything in
> >this build :)
> 
> IMO, you can learn everything you need to learn from 4-8 low-end PCs.
> there are certainly power differences versus and arm+low-end-gpu board
> like this, but since this device delivers pretty much token gflops,
> you might consider just using a raspberry pi or beaglebone if you have your
> heart set on avoiding the PC market.

I considered RPi and BeagleBone. I measure performance on RPi and get 68
DP MFLOPS after overclocking. There is unleashed performance of
VideoCore IV GPU (24 SP GFLOPS) but there is no C compiler for that
(only reverse engineered assembly). BeagleBone MX seems to have about
50-60 MFLOPS according to this:
http://www.vesperix.com/arm/atlas-arm/bench/gcc-a8/index.html

So this boards are not comparable with Jetson. I will take a look at
Mini/Nano-ITX PC market.

I appreciate your reply Mark, thanks.

Regards,
Piotr Król