[Beowulf] HPC in the cloud question

Fri May 8 03:54:05 PDT 2015

If you are on AWS start your eval with MIT Starcluster which is an 
amazing open source  suite of python code that builds elastic HPC 
clusters on AWS with MPI, shared filesystem and all the stuff your users 
would be familiar with. Defaults to Grid Engine as the scheduler (super 
convenient for me) but there are plugins for other software. OS I think 
is AmazonLinux which would feel like CentOS to an end-user. Developer 
and end user community is fantastic and super helpful. A few of them are 
on this list.

Another solid AWS-only product is cfncluster 
(http://cfncluster.readthedocs.org/en/latest/) -- written by AWS but 
also open source. Higher learning curve than MIT Starcluster but has the 
advantage of being written on top of the AWS CloudFormation stack which 
means it is fantastic at standing up, terminating and elastically 
growing a complex stack of AWS building blocks in a robust way.

I've got various other AWS specific thoughts and impressions on VPCs, 
placement groups for HPC nodes, instance type selection etc. but have to 
run to work. It would be interesting to get an HPC on IaaS discussion 
going here even though it may upset some traditionalists !

The other IaaS platforms are poor choices for flexible open-ended HPC. 
Far less building blocks to choose from to build flexible general 
purpose HPC type systems. At the lower end of the niche you just have 
dead end companies that slap a cloud label on rebranded VMWare that they 
sell to you at a high price.  Both Google Compute and Microsoft Azure 
can be good but the most value comes from purpose-built systems and 
workflows. We've done some stuff on Google Compute recently that has 
been very fun but it was for a very specific and tightly scoped project. 
For the flexible/research oriented stuff there is a lot more choice, 
freedom and flexibility on AWS simply because they have an order of 
magnitude more "building blocks" than the competition.

My $.02

-Chris

> Hutcheson, Mike <mailto:Mike_Hutcheson at baylor.edu>
> May 7, 2015 at 6:28 PM
> Hi. We are working on refreshing the centralized HPC cluster resources
> that our university researchers use. I have been asked by our
> administration to look into HPC in the cloud offerings as a possibility to
> purchasing or running a cluster on-site.
>
> We currently run a 173-node, CentOS-based cluster with ~120TB (soon to
> increase to 300+TB) in our datacenter. It¹s a standard cluster
> configuration: IB network, distributed file system (BeeGFS. I really
> like it), Torque/Maui batch. Our users run a varied workload, from
> fine-grained, MPI-based parallel aps scaling to 100s of cores to
> coarse-grained, high-throughput jobs (We¹re a CMS Tier-3 site) with high
> I/O requirements.
>
> Whatever we transition to, whether it be a new in-house cluster or
> something ³out there², I want to minimize the amount of change or learning
> curve our users would have to experience. They should be able to focus on
> their research and not have to spend a lot of their time learning a new
> system or trying to spin one up each time they have a job to run.
>
> If you have worked with HPC in the cloud, either as an admin and/or
> someone who has used cloud resources for research computing purposes, I
> would appreciate learning your experience.
>
> Even if you haven¹t used the cloud for HPC computing, please feel free to
> share your thoughts or concerns on the matter.
>
> Sort of along those same lines, what are your thoughts about leasing a
> cluster and running it on-site?
>
> Thanks for your time,
>
> Mike Hutcheson
> Assistant Director of Academic and Research Computing Services
> Baylor University
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf