[Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud

Wed Sep 21 04:02:39 PDT 2011

http://arstechnica.com/business/news/2011/09/30000-core-cluster-built-on-amazon-ec2-cloud.ars

$1,279-per-hour, 30,000-core cluster built on Amazon EC2 cloud

By Jon Brodkin | Published September 20, 2011 10:49 AM

Amazon EC2 and other cloud services are expanding the market for
high-performance computing. Without access to a national lab or a
supercomputer in your own data center, cloud computing lets businesses spin
up temporary clusters at will and stop paying for them as soon as the
computing needs are met.

A vendor called Cycle Computing is on a mission to demonstrate the potential
of Amazon’s cloud by building increasingly large clusters on the Elastic
Compute Cloud. Even with Amazon, building a cluster takes some work, but
Cycle combines several technologies to ease the process and recently used
them to create a 30,000-core cluster running CentOS Linux.

The cluster, announced publicly this week, was created for an unnamed “Top 5
Pharma” customer, and ran for about seven hours at the end of July at a peak
cost of $1,279 per hour, including the fees to Amazon and Cycle Computing.
The details are impressive: 3,809 compute instances, each with eight cores
and 7GB of RAM, for a total of 30,472 cores, 26.7TB of RAM and 2PB
(petabytes) of disk space. Security was ensured with HTTPS, SSH and 256-bit
AES encryption, and the cluster ran across data centers in three Amazon
regions in the United States and Europe. The cluster was dubbed “Nekomata.”

Spreading the cluster across multiple continents was done partly for disaster
recovery purposes, and also to guarantee that 30,000 cores could be
provisioned. “We thought it would improve our probability of success if we
spread it out,” Cycle Computing’s Dave Powers, manager of product
engineering, told Ars. “Nobody really knows how many instances you can get at
any one time from any one [Amazon] region.”

Amazon offers its own special cluster compute instances, at a higher cost
than regular-sized virtual machines. These cluster instances provide 10
Gigabit Ethernet networking along with greater CPU and memory, but they
weren’t necessary to build the Cycle Computing cluster.

The pharmaceutical company’s job, related to molecular modeling, was
“embarrassingly parallel” so a fast interconnect wasn’t crucial. To further
reduce costs, Cycle took advantage of Amazon’s low-price “spot instances.” To
manage the cluster, Cycle Computing used its own management software as well
as the Condor High-Throughput Computing software and Chef, an open source
systems integration framework.

Cycle demonstrated the power of the Amazon cloud earlier this year with a
10,000-core cluster built for a smaller pharma firm called Genentech. Now,
10,000 cores is a relatively easy task, says Powers. “We think we’ve mastered
the small-scale environments,” he said. 30,000 cores isn’t the end game,
either. Going forward, Cycle plans bigger, more complicated clusters, perhaps
ones that will require Amazon’s special cluster compute instances.

The 30,000-core cluster may or may not be the biggest one run on EC2. Amazon
isn’t saying.

“I can’t share specific customer details, but can tell you that we do have
businesses of all sizes running large-scale, high-performance computing
workloads on AWS [Amazon Web Services], including distributed clusters like
the Cycle Computing 30,000 core cluster to tightly-coupled clusters often
used for science and engineering applications such as computational fluid
dynamics and molecular dynamics simulation,” an Amazon spokesperson told Ars.

Amazon itself actually built a supercomputer on its own cloud that made it
onto the list of the world’s Top 500 supercomputers. With 7,000 cores, the
Amazon cluster ranked number 232 in the world last November with speeds of
41.82 teraflops, falling to number 451 in June of this year. So far, Cycle
Computing hasn’t run the Linpack benchmark to determine the speed of its
clusters relative to Top 500 sites.

But Cycle’s work is impressive no matter how you measure it. The job
performed for the unnamed pharma company “would take well over a week for
them to run internally,” Powers says. In the end, the cluster performed the
equivalent of 10.9 “compute years of work.”

The task of managing such large cloud-based clusters forced Cycle to step up
its own game, with a new plug-in for Chef the company calls Grill.

“There is no way that any mere human could keep track of all of the moving
parts on a cluster of this scale,” Cycle wrote in a blog post. “At Cycle,
we’ve always been fans of extreme IT automation, but we needed to take this
to the next level in order to monitor and manage every instance, volume,
daemon, job, and so on in order for Nekomata to be an efficient 30,000 core
tool instead of a big shiny on-demand paperweight.”

But problems did arise during the 30,000-core run.

“You can be sure that when you run at massive scale, you are bound to run
into some unexpected gotchas,” Cycle notes. “In our case, one of the gotchas
included such things as running out of file descriptors on the license
server. In hindsight, we should have anticipated this would be an issue, but
we didn’t find that in our prelaunch testing, because we didn’t test at full
scale. We were able to quickly recover from this bump and keep moving along
with the workload with minimal impact. The license server was able to keep up
very nicely with this workload once we increased the number of file
descriptors.”

Cycle also hit a speed bump related to volume and byte limits on Amazon’s
Elastic Block Store volumes. But the company is already planning bigger and
better things.

“We already have our next use-case identified and will be turning up the
scale a bit more with the next run,” the company says. But ultimately, “it’s
not about core counts or terabytes of RAM or petabytes of data. Rather, it’s
about how we are helping to transform how science is done.”