[Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud

Mon Oct 3 10:51:06 PDT 2011

Doug,

Thanks for posting that video. It confirmed what I always suspected
about clouds for HPC.

Prentice

On 10/03/2011 08:25 AM, Douglas Eadline wrote:
> Interesting and pragmatic HPC cloud presentation, worth watching
> (25 minutes)
> 
>  http://insidehpc.com/2011/09/30/video-the-real-future-of-cloud-computing/
> 
> --
> Doug
> 
>>
>> http://arstechnica.com/business/news/2011/09/30000-core-cluster-built-on-amazon-ec2-cloud.ars
>>
>> $1,279-per-hour, 30,000-core cluster built on Amazon EC2 cloud
>>
>> By Jon Brodkin | Published September 20, 2011 10:49 AM
>>
>> Amazon EC2 and other cloud services are expanding the market for
>> high-performance computing. Without access to a national lab or a
>> supercomputer in your own data center, cloud computing lets businesses
>> spin
>> up temporary clusters at will and stop paying for them as soon as the
>> computing needs are met.
>>
>> A vendor called Cycle Computing is on a mission to demonstrate the
>> potential
>> of Amazonâ€™s cloud by building increasingly large clusters on the Elastic
>> Compute Cloud. Even with Amazon, building a cluster takes some work, but
>> Cycle combines several technologies to ease the process and recently used
>> them to create a 30,000-core cluster running CentOS Linux.
>>
>> The cluster, announced publicly this week, was created for an unnamed
>> â€œTop 5
>> Pharmaâ€ customer, and ran for about seven hours at the end of July at a
>> peak
>> cost of $1,279 per hour, including the fees to Amazon and Cycle Computing.
>> The details are impressive: 3,809 compute instances, each with eight cores
>> and 7GB of RAM, for a total of 30,472 cores, 26.7TB of RAM and 2PB
>> (petabytes) of disk space. Security was ensured with HTTPS, SSH and
>> 256-bit
>> AES encryption, and the cluster ran across data centers in three Amazon
>> regions in the United States and Europe. The cluster was dubbed
>> â€œNekomata.â€
>>
>> Spreading the cluster across multiple continents was done partly for
>> disaster
>> recovery purposes, and also to guarantee that 30,000 cores could be
>> provisioned. â€œWe thought it would improve our probability of success if
>> we
>> spread it out,â€ Cycle Computingâ€™s Dave Powers, manager of product
>> engineering, told Ars. â€œNobody really knows how many instances you can
>> get at
>> any one time from any one [Amazon] region.â€
>>
>> Amazon offers its own special cluster compute instances, at a higher cost
>> than regular-sized virtual machines. These cluster instances provide 10
>> Gigabit Ethernet networking along with greater CPU and memory, but they
>> werenâ€™t necessary to build the Cycle Computing cluster.
>>
>> The pharmaceutical companyâ€™s job, related to molecular modeling, was
>> â€œembarrassingly parallelâ€ so a fast interconnect wasnâ€™t crucial. To
>> further
>> reduce costs, Cycle took advantage of Amazonâ€™s low-price â€œspot
>> instances.â€ To
>> manage the cluster, Cycle Computing used its own management software as
>> well
>> as the Condor High-Throughput Computing software and Chef, an open source
>> systems integration framework.
>>
>> Cycle demonstrated the power of the Amazon cloud earlier this year with a
>> 10,000-core cluster built for a smaller pharma firm called Genentech. Now,
>> 10,000 cores is a relatively easy task, says Powers. â€œWe think weâ€™ve
>> mastered
>> the small-scale environments,â€ he said. 30,000 cores isnâ€™t the end
>> game,
>> either. Going forward, Cycle plans bigger, more complicated clusters,
>> perhaps
>> ones that will require Amazonâ€™s special cluster compute instances.
>>
>> The 30,000-core cluster may or may not be the biggest one run on EC2.
>> Amazon
>> isnâ€™t saying.
>>
>> â€œI canâ€™t share specific customer details, but can tell you that we do
>> have
>> businesses of all sizes running large-scale, high-performance computing
>> workloads on AWS [Amazon Web Services], including distributed clusters
>> like
>> the Cycle Computing 30,000 core cluster to tightly-coupled clusters often
>> used for science and engineering applications such as computational fluid
>> dynamics and molecular dynamics simulation,â€ an Amazon spokesperson told
>> Ars.
>>
>> Amazon itself actually built a supercomputer on its own cloud that made it
>> onto the list of the worldâ€™s Top 500 supercomputers. With 7,000 cores,
>> the
>> Amazon cluster ranked number 232 in the world last November with speeds of
>> 41.82 teraflops, falling to number 451 in June of this year. So far, Cycle
>> Computing hasnâ€™t run the Linpack benchmark to determine the speed of its
>> clusters relative to Top 500 sites.
>>
>> But Cycleâ€™s work is impressive no matter how you measure it. The job
>> performed for the unnamed pharma company â€œwould take well over a week
>> for
>> them to run internally,â€ Powers says. In the end, the cluster performed
>> the
>> equivalent of 10.9 â€œcompute years of work.â€
>>
>> The task of managing such large cloud-based clusters forced Cycle to step
>> up
>> its own game, with a new plug-in for Chef the company calls Grill.
>>
>> â€œThere is no way that any mere human could keep track of all of the
>> moving
>> parts on a cluster of this scale,â€ Cycle wrote in a blog post. â€œAt
>> Cycle,
>> weâ€™ve always been fans of extreme IT automation, but we needed to take
>> this
>> to the next level in order to monitor and manage every instance, volume,
>> daemon, job, and so on in order for Nekomata to be an efficient 30,000
>> core
>> tool instead of a big shiny on-demand paperweight.â€
>>
>> But problems did arise during the 30,000-core run.
>>
>> â€œYou can be sure that when you run at massive scale, you are bound to
>> run
>> into some unexpected gotchas,â€ Cycle notes. â€œIn our case, one of the
>> gotchas
>> included such things as running out of file descriptors on the license
>> server. In hindsight, we should have anticipated this would be an issue,
>> but
>> we didnâ€™t find that in our prelaunch testing, because we didnâ€™t test
>> at full
>> scale. We were able to quickly recover from this bump and keep moving
>> along
>> with the workload with minimal impact. The license server was able to keep
>> up
>> very nicely with this workload once we increased the number of file
>> descriptors.â€
>>
>> Cycle also hit a speed bump related to volume and byte limits on
>> Amazonâ€™s
>> Elastic Block Store volumes. But the company is already planning bigger
>> and
>> better things.
>>
>> â€œWe already have our next use-case identified and will be turning up the
>> scale a bit more with the next run,â€ the company says. But ultimately,
>> â€œitâ€™s
>> not about core counts or terabytes of RAM or petabytes of data. Rather,
>> itâ€™s
>> about how we are helping to transform how science is done.â€
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>> --
>> This message has been scanned for viruses and
>> dangerous content by MailScanner, and is
>> believed to be clean.
>>
>>
> 
>