[Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud

Mon Oct 3 11:50:22 PDT 2011

There's a free & opensource application called StarCluster that can do
most (if not all?) of the EC2 provisioning & cluster setup for a High
Throughput Computing cluster:

http://web.mit.edu/stardev/cluster/

StarCluster sets up NFS, SGE, BLAS library, Open MPI, etc
automatically for the user in around 10-15 mins. StarCluster is
licensed under LGPL, written in Python+Boto, and supports a lot of the
new EC2 features (Cluster Compute Instances, Spot Instances, Cluster
GPU Instances, etc). Support for launching higher node count (100+
instances) clusters is even better with the new scalability
enhancements in the latest version (0.92).

And there are some tutorials on YouTube:

- "StarCluster 0.91 Demo":
http://www.youtube.com/watch?v=vC3lJcPq1FY

- "Launching a Cluster on Amazon Ec2 Spot Instances Using StarCluster":
http://www.youtube.com/watch?v=2Ym7epCYnSk

Rayson

=================================
Grid Engine / Open Grid Scheduler
http://gridscheduler.sourceforge.net

On Wed, Sep 21, 2011 at 7:02 AM, Eugen Leitl <eugen at leitl.org> wrote:
>
> http://arstechnica.com/business/news/2011/09/30000-core-cluster-built-on-amazon-ec2-cloud.ars
>
> $1,279-per-hour, 30,000-core cluster built on Amazon EC2 cloud
>
> By Jon Brodkin | Published September 20, 2011 10:49 AM
>
> Amazon EC2 and other cloud services are expanding the market for
> high-performance computing. Without access to a national lab or a
> supercomputer in your own data center, cloud computing lets businesses spin
> up temporary clusters at will and stop paying for them as soon as the
> computing needs are met.
>
> A vendor called Cycle Computing is on a mission to demonstrate the potential
> of Amazon’s cloud by building increasingly large clusters on the Elastic
> Compute Cloud. Even with Amazon, building a cluster takes some work, but
> Cycle combines several technologies to ease the process and recently used
> them to create a 30,000-core cluster running CentOS Linux.
>
> The cluster, announced publicly this week, was created for an unnamed “Top 5
> Pharma” customer, and ran for about seven hours at the end of July at a peak
> cost of $1,279 per hour, including the fees to Amazon and Cycle Computing.
> The details are impressive: 3,809 compute instances, each with eight cores
> and 7GB of RAM, for a total of 30,472 cores, 26.7TB of RAM and 2PB
> (petabytes) of disk space. Security was ensured with HTTPS, SSH and 256-bit
> AES encryption, and the cluster ran across data centers in three Amazon
> regions in the United States and Europe. The cluster was dubbed “Nekomata.”
>
> Spreading the cluster across multiple continents was done partly for disaster
> recovery purposes, and also to guarantee that 30,000 cores could be
> provisioned. “We thought it would improve our probability of success if we
> spread it out,” Cycle Computing’s Dave Powers, manager of product
> engineering, told Ars. “Nobody really knows how many instances you can get at
> any one time from any one [Amazon] region.”
>
> Amazon offers its own special cluster compute instances, at a higher cost
> than regular-sized virtual machines. These cluster instances provide 10
> Gigabit Ethernet networking along with greater CPU and memory, but they
> weren’t necessary to build the Cycle Computing cluster.
>
> The pharmaceutical company’s job, related to molecular modeling, was
> “embarrassingly parallel” so a fast interconnect wasn’t crucial. To further
> reduce costs, Cycle took advantage of Amazon’s low-price “spot instances.” To
> manage the cluster, Cycle Computing used its own management software as well
> as the Condor High-Throughput Computing software and Chef, an open source
> systems integration framework.
>
> Cycle demonstrated the power of the Amazon cloud earlier this year with a
> 10,000-core cluster built for a smaller pharma firm called Genentech. Now,
> 10,000 cores is a relatively easy task, says Powers. “We think we’ve mastered
> the small-scale environments,” he said. 30,000 cores isn’t the end game,
> either. Going forward, Cycle plans bigger, more complicated clusters, perhaps
> ones that will require Amazon’s special cluster compute instances.
>
> The 30,000-core cluster may or may not be the biggest one run on EC2. Amazon
> isn’t saying.
>
> “I can’t share specific customer details, but can tell you that we do have
> businesses of all sizes running large-scale, high-performance computing
> workloads on AWS [Amazon Web Services], including distributed clusters like
> the Cycle Computing 30,000 core cluster to tightly-coupled clusters often
> used for science and engineering applications such as computational fluid
> dynamics and molecular dynamics simulation,” an Amazon spokesperson told Ars.
>
> Amazon itself actually built a supercomputer on its own cloud that made it
> onto the list of the world’s Top 500 supercomputers. With 7,000 cores, the
> Amazon cluster ranked number 232 in the world last November with speeds of
> 41.82 teraflops, falling to number 451 in June of this year. So far, Cycle
> Computing hasn’t run the Linpack benchmark to determine the speed of its
> clusters relative to Top 500 sites.
>
> But Cycle’s work is impressive no matter how you measure it. The job
> performed for the unnamed pharma company “would take well over a week for
> them to run internally,” Powers says. In the end, the cluster performed the
> equivalent of 10.9 “compute years of work.”
>
> The task of managing such large cloud-based clusters forced Cycle to step up
> its own game, with a new plug-in for Chef the company calls Grill.
>
> “There is no way that any mere human could keep track of all of the moving
> parts on a cluster of this scale,” Cycle wrote in a blog post. “At Cycle,
> we’ve always been fans of extreme IT automation, but we needed to take this
> to the next level in order to monitor and manage every instance, volume,
> daemon, job, and so on in order for Nekomata to be an efficient 30,000 core
> tool instead of a big shiny on-demand paperweight.”
>
> But problems did arise during the 30,000-core run.
>
> “You can be sure that when you run at massive scale, you are bound to run
> into some unexpected gotchas,” Cycle notes. “In our case, one of the gotchas
> included such things as running out of file descriptors on the license
> server. In hindsight, we should have anticipated this would be an issue, but
> we didn’t find that in our prelaunch testing, because we didn’t test at full
> scale. We were able to quickly recover from this bump and keep moving along
> with the workload with minimal impact. The license server was able to keep up
> very nicely with this workload once we increased the number of file
> descriptors.”
>
> Cycle also hit a speed bump related to volume and byte limits on Amazon’s
> Elastic Block Store volumes. But the company is already planning bigger and
> better things.
>
> “We already have our next use-case identified and will be turning up the
> scale a bit more with the next run,” the company says. But ultimately, “it’s
> not about core counts or terabytes of RAM or petabytes of data. Rather, it’s
> about how we are helping to transform how science is done.”
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>

-- 
Rayson

==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/

Wikimedia Commons
http://commons.wikimedia.org/wiki/User:Raysonho