My Vision of the Future of Beowulf

By Donald Becker

I had the opportunity to present my vision for the future of Beowulf at the Boston Linux World in February 2005. I learned several lessons:

Here is a PDF of my presentation.

A Common Language for Understanding

For several years, technology companies have been promoting visions of aggregated computing power as the next step in the evolution of computers. Common standards and definitions for various models of distributed computing are in flux. The term "utility computing" is generally used to describe the availability of computing resources on demand. "Grid computing" is the collaborative use of geographically distributed computers, available over a network. Those computers are independently administered machines with separate administrative domains. The implication of this is a higher administrative overhead as a result of getting the domains to talk, both politically and technically. The differing environments and implementations raise the likelihood of machines being subtly incompatible with each other.

Clusters, on the other hand, are independent machines set up to be a single computational system. They have the opportunity to give the users the ability to manage a dynamically growing set of nodes, update programs and configurations, and present the resulting collection as a single machine.

We, as a community, have the chance to shape the discussion by helping other clarify the distinctions between the "utility computing" models, grids and clusters.

Interest in the potential of Beowulf clusters within a broader community of Linux users

While desktop machines are more powerful than ever, the earlier in the conceptual process that technical users can assess the impact of a variety of forces or variables on their designs or hypotheses, the greater the potential cost savings. Supercomputers remain out of financial reach for most companies, yet certain complex analyses benefit from more computing power than the best PCs or UNIX workstations can provide.

The applications used by national labs are frequently custom applications. Many of these developers anticipated the trend towards distributed computing and restructured their applications to run on parallel processors. The same foresight has been shown by many of the leading developers of commercial HPC software and a variety of critical applications are available in parallelized versions.

Innovative teams, whether they are producing a new molecule or combustion chamber, an animated movie, or the analysis of an oil reservoir, are continually challenged to be creative while operating within time-proven design cycles. They are asked to bring their products to market faster as the market pressures facing technology companies increase. To meet those demands, technical teams seek to bring more analytical power to bear earlier in the process. Emphasis varies from finding a faster, cheaper way to produce an existing product to creating something that does not yet exist.

A large class of problems can be readily solved on single processor systems, including PCs. However many of the increasingly important design problems require the solution of coupled (stress plus thermal plus fluids) systems. To solve these problems in reasonable time frames, parallelism is needed, particularly as problems become increasingly complex, while processor clock speeds are no longer increasing. At the other end of the spectrum are problems with very long run times (days to weeks), which require supercomputers or capabilities like zero fault tolerance. In between lies the "sweet spot" for Linux HPC clusters.

Telling Our Collective Story

Linux HPC clusters are an example of innovation achieved through novel ways of thinking about proven technology. This combination allows both exceptional performance and superior reliability though it escaped the attention of those who perhaps used a different definition of innovation.

We incubated the Beowulf cluster in scientific computing circles, where complex problems could only be tackled by individuals possessing domain knowledge, parallel programming expertise and the inclination and know-how to build, debug and manage their own clusters. Our Beowulf community emerged, able to leverage the open-source software model to meet individual needs. Considered subversive by the supercomputing establishment, our community was emboldened to start a movement to democratize supercomputing.

We've been willing to challenge the beliefs of the Beowulf pioneers. Not everyone had the time, inclination, and expertise to build their own clusters, write applications, and manage the systems. Commercial customers reasonably expect reliable hardware, supported applications, training, and documentation. As was recognized by the publishers of Linux distributions such as RedHat and SUSE, a totally Open Source model cannot sustain that level of support.

To further the innovations needed to bring Linux HPC clusters to workday problems, I founded Scyld Software in 1998. The goal of Scyld Software was, and continues to be, to harness the capabilities of the Beowulf approach for the complex analysis, modeling simulation tasks of technical workgroups within enterprises. The Company also seeks to bring innovation to widespread management of shared computing resources without requiring the deep system knowledge and time required to build and operate the early clusters.

My Vision for the Future

There is a common belief that clusters are everywhere in the HPC world and will grow even more rapidly in the future. I want to apply the cluster model to a broader array of commercial problems.

As clusters become commonplace for HPC applications in industry, they will be accepted by IT groups in the enterprise. We need to support the simultaneous need for reliability, scalability, and innovation. We have an evolving computer ecosystem where the newest advances arrive first on commodity hardware and Linux.

I want to appeal to the visionary in users, system administrators, and developers so we can catalyze their ideas and make their visions real. It's critical to make sure that the individual doesn't have to change the way he or she already works. Whether one is using a server to manage a HPC application, focus on High Performance Throughput, run small scale parallel jobs, improve data mining, or speed up dynamic web applications I want to make it obvious to use to use a cluster.

November Column: The Future of Beowulf
September Column: Ten Years of Beowulf (a look back)
August Column: Transformed