[Beowulf] Using commercial clouds for HPC

Thu May 7 13:24:55 PDT 2009

stephen mulcahy wrote:
> Hi,
> 
> I'm pretty sure this came up in some shape or form at some stage on this
> list but after extensive Googling and Swishing(!?) of the list I can't
> find anything concrete so apologies if I'm restarting an old thread.
> 
> Has anyone done any investigation into using commerical clouds such as
> Amazon's EC2 for HPC type loads?

I built a small 16 node cluster on EC2, and it worked well.  I configured the
cluster to use MPI, a shared file system via nfs, and a batch queue.  This was
before their locality API was available so I had to do some snooping about to
figure out how close my nodes were.  It ended up performing much like a normal
GigE connected cluster, suitable for any job that wasn't too interconnect
intensive.  Overhead in virtualization for the CPU and memory system was
rather small, seemed under 20% which isn't too big a deal since you can always
just allocate more nodes.

Amazon is also making quite a few datasets available, human genome, US Census,
unigene, swine flu sequence, and many others.  Certainly if you need access to
said datasets that would make ec2 that much more attractive.

So if you are willing to pay $0.10 to $0.80 per hour for a node it seems
feasible to me.

> I'm still not entirely clear on whether they are a suitable alternative
> from a software/hardware perspective or whether they have lots of hidden
> bottlenecks due to their virtual nature that HPC codes would run badly
> and/or not at all (obviously depending on the specific code).

Hard to imagine things not running, linux is linux for the most part.  My CPU
benchmarks seemed reasonable.  I'd suggest if you have dollar or two open an
account and play with a node for an hour, you should quickly have an idea of
what performance to expect.  I believe their location API now allows you to
allocate nodes with high physical locality.

I had planned to see how much their switch uplinks were oversubscribed but it
wasn't particularly feasible without the location API.

> Also, if anyone is using these for HPC type loads, how does the pricing
> work out relative to owning your own similar sized cluster (if anyone
> has done those numbers)? I notice that EC2 has different pricing for
> "high cpu" - but I wonder is this typical business app "high cpu"
> (little 25% spikes every few minutes) or typical hpc app "high cpu"
> (long 90% spikes all the time :)

Seems like they just mean more CPU per unit of memory.