[Beowulf] HPC in Windows

Tue Oct 12 08:35:05 PDT 2004

On Mon, 11 Oct 2004, Erik Paulson wrote:

> On Sat, Oct 09, 2004 at 06:11:01PM -0400, Robert G. Brown wrote:
> > On Sat, 9 Oct 2004, Rajiv wrote:
> > 
> > > Dear All,
> > >     Are there any Beowulf packages for windows?
> > 
> > Not that I know of.  In fact, the whole concept seems a bit oxymoronic,
> > as the definition of a beowulf is a cluster supercomputer running an
> > open source operating system.
> > 
> 
> It's really time that gave up on trying to hold a strong definition
> to "beowulf". It's like kleenex or hacker/cracker. The world doesn't
> care. Clusters of x86 PCs doing "HPC" = beowulf

Now look what you did.  Used up my whole morning, just about.

The easily bored can skip the rant below.

<rant id="389">

This (what's a beowulf?) is list discussion #389, actually.  Or maybe it
is that the discussion has occurred 389 times, I can't remember.  I do
remember that the first time I participated in it was around seven or
eight years ago, that I advanced the point of view that you espouse here
-- and that I changed my mind.

The definition of beowulf as OPPOSED to "just" a cluster of systems
(nuttin' in the definition about them being "PC"s, just COTS systems)
was given by the members of the original beowulf project with explicit
reasons for each component.  Note well that cluster supercomputing was
at the time not new -- I'd been doing it myself by then for years (on
COTS systems, for that matter, if Unix workstations can be considered
off the shelf), and I was far, far from the first.

At that time, there were already NOWs, COWs, PoPs and more.  See
Pfister's "In Search of Clusters" for a lovely, balanced, and not
terribly beowulf-centric historical review.

Two things differentiated the beowulf from earlier cluster efforts.

  a) Custom software designed to present a view of the cluster as "a
supercomputer" in the same sense (precisely) that e.g. an SP2 or SP3 is
"a supercomputer" -- a single "head" that is identified as being "the
computer", specialized communications channels to augment the speed of
communications (then quite slow on 10 Mbps ethernet), stuff like bproc
designed to support the member computers being "processors" in a
multiprocessor machine rather than standalone computers.  Note that this
idea was NOT totally original to the beowulf project, as PVM already had
incorporated much of this vision years earlier.

  b) The fact that the beowulf utilized an open source operating system
and was built on top of open source software.  The reasons for this at
the time were manifest, and really haven't changed.  In order to realize
their design goals that >>extended<< the concepts already in place in
PVM, they had to write numerous kernel drivers (hard to do without the
kernel source) as well as a variety of support packages.  Don Becker
wrote (IIRC) something like -- would that be all of the linux kernel's
network drivers at the time or just 80% of them? -- hard to remember at
this point, but a grep on Becker in /usr/src/linux/drivers/net is STILL
pretty revealing.  Now look for Sterling and Becker's contributions to
the WinXX networking stack.  Hmmmm....

The insistence on COTS hardware, actually, is what I'd consider the
"weakest" component of the original definition, as it is the one
component that was readily bent by the community in order to better
realize the design goal of a parallel supercomputer capable of running
fine grained parallel code competitively with "big iron" supercomputers.
The beowulf community readily embraced non-commodity networks when they
appeared. Note that I consider "commodity" as meaning multisourced with
real competition holding down prices and generally built on an "open"
standard, e.g. ethernet is open and has many vendors, myrinet is not
open and is available only from Myricom (although at all points there
has been at least some generic competition at least between high end
proprietary networks).  

Myrinet historically was perhaps >>the<< key component that permitted
beowulves to reach and even exceed the performance of so-called big iron
supercomputers for precisely the kind of fine grained numerical problems
that the supercomputers had historically dominated.  I remember well
Greg Lindahl, for example, showing graphs of Alpha/Myrinet speedup
scaling compared to e.g. SP-series systems and others, with the beowulf
model actually winning (at less than 1/3 the price, even using the
relatively expensive hardware involved).

> And on the Beowulf on Windows bit - 
> http://www.amazon.com/exec/obidos/tg/detail/-/0262692759/qid=1097514164/sr=8-1/ref=sr_8_xs_ap_i1_xgl14/104-7091285-1915902?v=glance&s=books&n=507846
> 
> "Beowulf Cluster Computing with Windows (Scientific and Engineering Computation)
> by Thomas Sterling" - If Tom says that you can build a beowulf on 
> Windows, I think you can. 

I can only reply with:

  http://www.beowulf.org/community/column2.html

by Don Becker, in which he points out that when they first met, Sterling
was "obsessed with writing open source network drivers".  Or if you
prefer, Question Number One of the beowulf FAQ:

  1. What's a Beowulf?

  Beowulf Clusters are scalable performance clusters based on commodity
  hardware, on a private system network, with open source software
  (Linux) infrastructure.

  Each consists of a cluster of PCs or workstations dedicated to running
  high-performance computing tasks. The nodes in the cluster don't sit
  on people's desks; they are dedicated to running cluster jobs. It is
  usually connected to the outside world through only a single node.

  Some Linux clusters are built for reliability instead of speed. These
  are not Beowulfs.

Or check out my "snapshot" of the original beowulf website, preserved in
electronic amber (so to speak) from back when I ran a mirror:

 http://www.phy.duke.edu/resources/computing/brahma/Resources/beowulf/

The introduction and overview contains a number of lovely tidbits
concerning the beowulf design and how it differs from a NOW.  It makes
it pretty clear that the only way a pile of WinXX boxes could be "a
beowulf" (as opposed to a NOW) would be if Microsoft Made it So -- the
WinXX kernels and networking stack and job scheduling and management are
essentially inaccessible to developers in an open community, which is
why WinXX clusters like Cornell's (however well they work) stand alone,
supported only to the extent that MS or Cornell pay for it with little
community synergy.

Nobody would argue, of course, that one can't build a NOW based on WinXX
boxes.  A number exist.  WinXX boxes run PVM or MPI (and have been able
to for many years, probably even predating the beowulf project although
I'm too lazy to check the mod dates of the WinXX ifdefs in PVM).  One
can also obviously build a grid with WinXX boxes in it, probably more
easily than one can build a true parallel cluster.  Grid-style clusters
(a.k.a.  "compute farms") predate even virtual supercomputers in cluster
taxonomy, for all that they have a new name and a relatively new set of
high-level support software (just as the beowulf has, in the form of
bproc implemented in clustermatic and scyld).

Those of use who used to "roll our own" gridware to permit the use of
entire LANs of workstations on embarrassingly parallel problems find
this (toplevel support software) a welcome development, and it has
indeed blurred the lines between beowulfs and other NOWs to some degree,
but if anything it is DIMINISHING the identification of all clusters as
"beowulfs".  Look at all the Grid projects in the universe -- BioGRID,
the smallpox grid, ATLAS grid, PatriotGrid -- grids are proliferating
like crazy, but they aren't considered or referred to as beowulfs.  In
most cases "beowulf" isn't even mentioned in their toplevel
documentation.

One of the fundamental reasons for differentiation is this very list.
Few people who have been on the list for a long time and who have worked
with beowulfs and other kinds of open source clusters for a long time
have any particular interest in providing community support to cluster
computing under Windows.  For one thing, it is nearly impossible -- it
requires somebody with trans-MCSE knowledge of Windows' kernels,
libraries, drivers, networking stack, and tools including the various
WinXX ports of key cluster software where it exists.  For another,
people who work in that community who DO have that level of expertise
don't seem to want to share -- they want to sell.  One has to pay to
become a MCSE; one then expects a high rate of consultative return on
the investment.  One cannot easily obtain access to WinXX source code,
and open or not, access to kernel-level source code turns out to be
essential to getting maximal performance out of a true beowulf or even
advanced non-beowulf style cluster.  

Besides, nearly all the tools involved (beyond userspace stuff like PVM
or MPI in certain flavors) are SOLD and supported by Microsoft (only) or
other Microsoft-connected commercial developers and the only "benefit"
we get back in the community from providing support for them is to
increase their profits and to encourage them to turn around and resell
us our own developments and ideas at a high cost.  So let THEM provide
the consultation and expertise and "intellectual property" they prize so
highly; I will not contribute.

Contrast that with the really rather unbelieveable level of support
freely offered via this list to (yes) general cluster computer users and
builders (not just "beowulf" builders by the strict definition).  This
support is predicated on the fundamental notions of open source software
-- that effort expended on it comes back to you amplified tenfold as the
COMMUNITY is strengthened in the open and free exchange of ideas.
Consider the many tools and products that support beowulfery (or
generalized cluster computer operation) that would simply be impossible
to develop in a closed source proprietary model.  People who participate
in this sort of development have no desire to do all the work to create
new tools and products only to have Microsoft and its software lackeys
do its usual job of co-opting the tool, branding it, shifting the core
standard from open to proprietary, and then squeezing out the original
inventors (extended rant available on request:-).

For all of these reasons, I think that it is worthwhile to maintain the
moderately strict definition of "a beowulf" as a particular isolated
network arrangement of COTS systems running open source software and
functioning as a cluster capable of running anything from fine grained
parallel problems down to distributed single tasks with a single "view"
of task ID space.  This is a fairly open and embracing definition --
people on the list run "beowulfs" with a single head, multiple heads,
many operating systems other than Linux (most of them open source --
WinXX users are subjected to fairly merciless teasing if nothing else
...hotter:-).  It is differentiated from (recently emerging) definitions of
Grid-style clusters, from my much older definition of a "distributed
parallel supercomputer" (built largely of dual use workstations that
function as desktop machines in a LAN while still permitting
long-running numerical tasks to be run in the background), from MUCH
older definitions of NOWs, COWs, Piles of PCs.

So, if somebody says they've "built a beowulf" out of a bunch of WinXX
boxes, yes, I know what they mean, even though what they say is almost
certainly not correct.  The list is fairly tolerant of pretty much
anybody doing any kind of cluster computing, even Windows based NOWs or
Grids.  "Extreme Linux" as a more general vehicle for linux cluster
development never quite took off, and www.extremelinux.org continues to
be a blank page as it has been for years now.  As I said above, I
personally don't even DO "real" beowulf computing and never have -- my
clusters tend to be NOWs, although we're gradually shifting more towards
a Grid model as improved software makes this the easy path support-wise.

As a final note, I personally view the original PVM team as the
"inventors of commodity cluster computing" even more than Sterling and
Becker (much as I revere their contributions).  If a "beowulf" is a
network of computers running e.g. PVM on top of proprietary software,
Dongarra et. al. beat Sterling and Becker to the punch by years.  This
isn't a crazy idea -- PVM already contains "out of the box" many of the
design goals of the beowulf project -- a unified process id space
(tids), a single control head that supports the "virtual machine" model,
the ability to run on commodity hardware.  It just does it in userspace,
and hence has limits on what can be accomplished performance-wise, and
has the usual PVM vs MPI problems with the older supercomputer
programmers (who all used MPI, for interesting historical reasons).
(Interestingly, "old hands" in the beowulf/cluster business nearly all
tell me that they used to use and still prefer PVM, while MPI is still
the "commercially salable" parallel library that better favors the
traditional big iron supercomputing model;-)

To what PVM already provided, Sterling and Becker contributed the
notions of >>network isolation<< to achieve predictable network latency,
>>channel bonding<< of network channels, built on top of open source
network drivers, to improve network bandwidth (an accomplishment
somewhat overshadowed by the rapid development faster networks and
low-latency networks), and eventually >>kernel-level modifications<<
that truly converted a cluster of PCs into a "single machine" the
components of which could no longer stand alone but were merely
"processors" in a massively parallel system with a single user-level
kernel interface.

So how in the world can Sterling argue that this >>beowulf<< software,
developed by the original beowulf team, is available for Windows?  Did I
miss something?

Network isolation, fine, that's a matter of trivial network arrangement
that anybody with $50 for an OTC router/firewall can now accomplish, but
channel bonded networks?  Unified process id spaces?  Kernel
modifications that make nodes into virtual processors in a single
"machine"?  Not that I know of, anyway, and obviously impossible without
fairly open access to Windows source code in any event.  At a guess, it
would require such a violent modification even to the more modern and
POSIX compliant WinXX's that the result could be called "Windows" only
in the sense that linux running a windowing system can be called
"Windows" -- pretty much a complete rewrite and de-integration of the
GUI from the OS kernel would be required (something that Microsoft has
argued in court is impossible, amusingly enough, as they have sought to
convince an ignorant public that Internet Explorer -- a userspace
program if ever there was one -- cannot be be de-integrated from
Windows:-).

Asserting that there are truly Windows-based beowulfs does not make it
so, and coopting the term "beowulf" to apply to generic computing models
and tools that preceded the project by years is a kind of newspeak.
I'll have to just go on thinking of the idea as an oxymoronic one, at
least until Microsoft opens its source code or somebody succeeds in
rewriting history and the original definition and goals of the beowulf
project.

> ps - define "supercomputer" :)

AT THE TIME of the beowulf project, the definition was actually pretty
clear, if only by example.  I'd say it is still pretty clear, actually.

At that time (and still today, mostly) the generic term "computer"
embraced:

  a) Mainframes (the oldest example of "computer", still annoyingly
common in business, industry and academe).

  b) Minicomputers (e.g. PDP's, Vaxes, Harris's).  Basically
cheaper/smaller versions of mainframes that generally stood alone
although of course a number of them were used as the core servers for
Unix-based workstation LANs.

  c) Workstations (e.g. Suns, SGIs).  Typically desktop-sized computers
in a client-server arrangement on a LAN. Server-class Suns and SGIs were
sometimes refrigerator-sized units that were de facto minicomputers,
blurring the lines between b) and c) in the case where both were running
Unix flavors (or at least real multitasking operating systems).

  d) Personal computers.  A "personal" computer was always a desktop
sized unit, and the term "PC" generally applied to x86-family examples,
although clearly Apples were (and continue to be) PCs as well.  Note
that PCs were sometimes as capable, hardware-wise, as workstations and
had been networkable for years, so networking or hardware per se had
nothing to do with being a PC vs a workstation.  A PC really was
differentiated from being a workstation by a key feature of its
operating system -- the INability to login to the system remotely over a
network.  To use a PC, you had to sit at the PC's actual interface.
(Note that aftermarket tools like "PC anywhere" did not a PC a
workstation make).

  e) Supercomputers.  A supercomputer was (and continues to be) a
generic term for a "computer" capable of doing numerical (HPC)
computations much faster than the CURRENT GENERATION of a-d computers.
Obviously a moving target, given Moore's Law.  From the "first"
so-called supercomputer, the 12 MFLOP Cray-1, through to today's top 500
list, the differentiating feature is obviously RELATIVE performance, as
the Palm Tungsten C in my pocket (with its 400 MHz CPU) is faster than
the Cray 1.  

  f) Today there is a weak association between "supercomputer" and
"single task" HPC (so Grids and compute farms of various sorts are
somewhat excluded, probably BECAUSE of the top500 list and its
insistence on parallel linpack-y sorts of stuff as the relevant measure
of supercomputer performance).  So Grids have emerged as a kind of
cluster in their own right that isn't ordinarily viewed as a
supercomputer although a Grid is essentially unbounded from above in
terms of aggregate floating point capacity in a way that supercomputers
are not.  One could make a grid of all the top500 supercomputers, in
fact...

Note that historically supercomputers are differentiated from other a-d
class computers not by being "mainframe" or not, not by being vector
processor based vs interconnected parallel multiprocessor based, not by
its operating system, not even by its underlying computational paradigm
(e.g. shared memory vs message passing), certainly not by its ABSOLUTE
performance, but strictly by relative numerical performance.  My Palm a
decade ago would have been an export-restricted munition supercomputer,
usable by rogue nations to simulate nuclear blasts and build WMD.  Today
it is a casual tool used by businessmen to check the web and email and
remind them of appointments, while other munitions-quality systems are
now toys, used by my kids to race virtual motorcycles around
hyperrealistically rendered city streets.

Talk about swords into plowshares...;-)

The exact multiplier between "ordinary computer" performance and
supercomputer performance is of course not terribly sharp.  Over the
years, a factor of order ten has often sufficed.  In the original
beowulf project, aggregating 16 80486DX processors (at best a few
hundred aggregate FLOPS, again, my Palm probably would beat it at a
walk) was enough.  Nowadays perhaps we are jaded, and only clusters
consisting of hundreds or thousands of CPUs, instead of tens, are in the
running.  Maybe only the top500 systems are "supercomputers.  Maybe the
term itself is really obsolete, as fewer and fewer systems that are
anything BUT a beowulf style cluster (even if it is assembled and sold
as a big iron "single system" with its internal cluster CPUs and IPC
network and memory model hidden by a custom designed operating system)
appear in the HPC marketplace.

Still, I think most people still know what "supercomputer" means.  In
fact, when one looks over the current top500, it appears that it has
>>almost<< become synonymous with the term "beowulf";-)

But not (note well!) with the term "grid", as grids aren't architected
to excell at linpack, and a grid is very definitely not a beowulf.

As far as I can tell, just about 100% of the top500 are clusters (COTS
or otherwise) architected along the lines laid out by the beowulf
project, with 95% of them having lots scalar processors and the
remaining 5% having lots of vector processors.  Unfortunately, the
top500 (which I continue to think of as being almost totally useless for
anything but advertising) doesn't present us with a clear picture of the
operating systems or systems software architectures in place on most of
the clusters.  In fact, it provides remarkably little useful information
except the name of the cluster manufacturer/integrator/reseller (imagine
that;-).  Two clusters on the list (#146 at Cornell and #233 in Korea)
are explicitly indicated as running Windows.  Looking over the general
cluster hardware architectures and manufacturer/integrator/resellers, I
would guess that linux is overwhelmingly dominant, followed by freebsd
and other (proprietary) flavors of Unix, with WinXX quite possibly dead
last.

Open source development is an evolutionary model, capable of paradigm
shifts, far jumps in parametric space, and N^3 advantage in searching
high dimensional spaces.  Proprietary software development is by its
nature a gradient search process, prone to optimizing in perpetuity
around a slowly evolving local minimum, making long jumps only when it
steals fully developed memetic patterns (such as the Internet, cluster
computing, and many more) more often than not produced by evolutionary
communities.  To be fair, new patterns are sometimes introduced a priori
by brilliant individuals without clear roots in open communities (e.g.
"Turbo" compilers), although that is less common in recent years as the
open source development process has itself evolved.  The individuals
only RARELY work for major corporations any more, and the corporations
that are famous as idea factories -- e.g. Bell Labs -- created internal
"open" communities of their very own where the new ideas were incubated
and exchanged and kicked around.

It's just a matter of mathematics, you see.

   Linux = mammal (sorry, Tux:-)

Evolving at a stupendous speed (compare everything from kernel to
complete distributions over the last decade)

   WinXX = Great White Shark

Evolutionarily frozen, remarkably efficient at what it does, immensely yet
curiously vulnerable...

</rant>

Well, that's enough rant for the day.  I've GOT to get some actual work
done...

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu