[Beowulf] HPC workflows

Sun Dec 9 07:26:05 PST 2018

On Fri, 7 Dec 2018 16:19:30 +0100, you wrote:

>Perhaps for another thread:
>Actually I went t the AWS USer Group in the UK on Wednesday. Ver
>impressive, and there are the new Lustre filesystems and MPI networking.
>I guess the HPC World will see the same philosophy of building your setup
>using the AWS toolkit as Uber etc. etc. do today.
>Also a lot of noise is being made at the moment about the convergence of
>HPC and Machine Learning workloads.
>Are we going to see the MAchine Learning folks adapting their workflows to
>run on HPC on-premise bare metal clusters?
>Or are we going to see them go off and use AWS (Azure, Google ?)

I suspect that ML will not go for on-premise for a number of reasons.

First, ignoring cost, companies like Google, Amazon and Microsoft are
very good at ML because not only are they driving the research but
they need it for their business.  So they have the in house expertise
not only to implement cloud systems that are ideal for ML, but to
implement custom hardware - see Google's Tensor Processor Unit.

Second, setting up a new cluster isn't going to be easy.  Finding
physical space, making sure enough utilities can be supplied to
support the hardware, staffing up, etc.  are not only going to be
difficult but inherently takes time when instead you can simply sign
up to a cloud provider and have the project running within 24 hours.
Would HPC exist today as we know it if the ability to instantly turn
on a cluster existed at the beginning?

Third, albeit this is very speculative.  I suspect ML learning is
heading towards using custom hardware.  It has had a very good run
using GPU's, and a GPU will likely always be the entry point for
desktop ML, but unless Nvidia is holding back due to a lack of
competition is does appear the GPU is reaching and end to its
development much like CPUs have.  The latest hardware from Nvidia is
getting lacklustre reviews, and the bolting on of additional things
like raytracing is perhaps an indication that there are limits to how
much further the GPU architecture can be pushed.  The question then is
the ML market big enough to have that custom hardware as a OEM product
like a GPU or will it remain restricted to places like Google who can
afford to build it without the necessary overheads of a consumer
product.