[Beowulf] HPC workflows

Sun Dec 9 13:45:29 PST 2018

> On Dec 9, 2018, at 7:26 AM, Gerald Henriksen <ghenriks at gmail.com> wrote:
> 
>> On Fri, 7 Dec 2018 16:19:30 +0100, you wrote:
>> 
>> Perhaps for another thread:
>> Actually I went t the AWS USer Group in the UK on Wednesday. Ver
>> impressive, and there are the new Lustre filesystems and MPI networking.
>> I guess the HPC World will see the same philosophy of building your setup
>> using the AWS toolkit as Uber etc. etc. do today.
>> Also a lot of noise is being made at the moment about the convergence of
>> HPC and Machine Learning workloads.
>> Are we going to see the MAchine Learning folks adapting their workflows to
>> run on HPC on-premise bare metal clusters?
>> Or are we going to see them go off and use AWS (Azure, Google ?)
> 
> I suspect that ML will not go for on-premise for a number of reasons.
> 
> First, ignoring cost, companies like Google, Amazon and Microsoft are
> very good at ML because not only are they driving the research but
> they need it for their business.  So they have the in house expertise
> not only to implement cloud systems that are ideal for ML, but to
> implement custom hardware - see Google's Tensor Processor Unit.
> 
> Second, setting up a new cluster isn't going to be easy.  Finding
> physical space, making sure enough utilities can be supplied to
> support the hardware, staffing up, etc.  are not only going to be
> difficult but inherently takes time when instead you can simply sign
> up to a cloud provider and have the project running within 24 hours.
> Would HPC exist today as we know it if the ability to instantly turn
> on a cluster existed at the beginning?
> 
> Third, albeit this is very speculative.  I suspect ML learning is
> heading towards using custom hardware.  It has had a very good run
> using GPU's, and a GPU will likely always be the entry point for
> desktop ML, but unless Nvidia is holding back due to a lack of
> competition is does appear the GPU is reaching and end to its
> development much like CPUs have.  The latest hardware from Nvidia is
> getting lacklustre reviews, and the bolting on of additional things
> like raytracing is perhaps an indication that there are limits to how
> much further the GPU architecture can be pushed.  The question then is
> the ML market big enough to have that custom hardware as a OEM product
> like a GPU or will it remain restricted to places like Google who can
> afford to build it without the necessary overheads of a consumer
> product.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

My data points are the opposite. 

1. As it progresses from experiment to real use, most AI/ML/DL is taking place near where the data is. Since for many that data is on-premises, that is on-premises.  For cloud services, it stays on the cloud.

2. The investment isn’t huge and is incremental, so there isn’t a strong barrier to buying the kit. 
Models never get ‘finished’ and require regular retesting on historical and new data, so they can keep it busy. The GPUs are plenty good enough because most of the frameworks parallelize (scale-out) easily.   There is also a desire to test models on other similar data, but that data takes prep and a common data source. The cost of this sized dedicated storage is not prohibitive, but moving from/to the cloud can be.  Most projects start very small to prove effectiveness. It isn’t a big tender to get started - unless you are doing Autonomous Driving... 

3. There will be specialized solutions for inference, but that isn’t the same as training. IMHO, the specialized silicon or designs will be driven by using the AI near the edge within the constraints of power, footprint, etc. Training will still be scale-out & centralized. GPUs will still work for a long time, just like CPUs did.