Beowulf: A theorical approach

Robert G. Brown rgb at phy.duke.edu
Thu Jun 22 09:20:19 PDT 2000


On Thu, 22 Jun 2000, Nacho Ruiz wrote:

> Hi,
> 
> I'm doing my final year project and I'm writting about the Beowulf project
> an the Beowulf clusters.
> I've been reading several documents about the beowulf clusters, but I would
> like to ask all of you some questions about them.
> 
> As I've seen the main objective behind any Beowulf cluster is the
> price/performance tag, specially when compared to supercomputers. But as
> network hardware and commodity systems are becoming faster and faster
> (getting closer to GHz and Gigabit speeds), could you think on competting
> directly with supercomputers?

They already do, in many arenas.  Greg Lindahl has given some wonderful
talks where he has showed alpha/myrinet beowulves that outperformed a
number of the big iron systems.  His company (and some other turnkey
beowulf companies) are winning big contracts because their systems
aren't just cheaper, they are also in many cases faster AND cheaper.

> As I see it the Beowulf cluster idea could be based in the distributed
> computign and the parallel computing: you put more CPUs to get more speedup,
> but as you can't have all the CPUs in the same machine you use several. So
> the Beowulf cluster could fit in between the distributed computing and the
> supercomputers (vetorial computers, parallel computers,..etc). You have
> advantages from both sides: parallel programming and high scalability; but
> you also have several drawbacks: mainly interconection problems. Do you
> think that with 10 Gb conections (OC-192 bandwith), SMP in chip (Power 4)
> and  massive primary and secondary memory devices at low cost, you could
> have a chance to beat most of the traditional supercomputers? or is not your
> "goal"?

You should skim over some of the talks and documents on
http://www.phy.duke.edu/brahma (near the top).  In particular, read the
sections that describe the scaling of parallel computer performance
(Amdahl's Law and various improved estimates thereof).  Then meditate
upon the fact that many of the big iron supercomputers have what amounts
to a beowulf architecture, even if they are SMP in that all the
processors reside in a single box (that is, they still use a de facto
"network" to communicate).  It might help to read Pfister's book "In
search of clusters", especially his discussion of (CC-)NUMA to see the
primary alternative, where shared memory is used to communicate between
processors/tasks.  It would be worthwhile for you to check out the
Trapeze project as well, which seeks to extend virtual memory over a
"beowulf-style" network architecture, exploiting the fact that a network
connection, however slow, is still three orders of magnitude faster than
disk access (so swapping or paging to EVEN an NFS-mounted RAM disk on a
second system might well be considerably faster than swapping or paging
to a real disk in the same cabinet -- emphasis because NFS is not a
particular fast or efficient protocol).

Finally, recognize that there is no "goal" shared by all the people on
this list other than getting our work done.  We are process oriented,
not goal oriented;-).  For nearly everybody on the list, beowulfs
ALREADY "beat" traditional supercomputers in one or more critical
dimensions of the highly multidimensional cost/benefit decision all
humans with work to do face when trying to decide how to go about doing
it.  If my goal is to surf the web, I don't buy a Cray, I buy a PC.  If
my goal is to complete as many independent Monte Carlo simulations as
possible per unit time per dollar, I ALSO don't buy a Cray, I buy a
network of PC's organized into a "beowulf".  If my goal is to solve CFD
problems, weather problems, cosmology problems (closer to the "grand
challenge" level) then maybe I buy a Cray (or whatever) or maybe I build
or buy a beowulf of advanced and competitive design (which is likely to
cost a lot more than a simple network of PC's, but a lot less than the
Cray). Even the required time to problem completion matters -- if I
>>must<< solve the problem as rapidly as possible whatever the cost I'll
probably pick a different system than I'd pick if my goal is to solve
the problem as rapidly as possible given a fixed budget of X.

Driven by a mix of the overwhelming cost-benefit advantage of
beowulf/cluster supercomputing for MANY problems and the fact that
playing with this in the open source world is damn good and interesting
and useful computer science (Bell prizes have been awarded for
contributions in beowulfery) beowulfs have evolved into the
supercomputer architecture of choice for thousands of sites, many of
them "tiny" by comparison with the big sites.  I have a five CPU beowulf
in my >>home<< office (might get it up to eight this year, with luck and
another $1.5K or so invested).  A joke by the standards of a T3x or SPx,
but even so, I get excellent performance on some problems I'm interested
in AND it gives me a very convenient laboratory for my amateurish forays
into computer science.  Lots of physics, or chemistry, or computer
science, or engineering departments in universities have built small
16-32 node beowulfs.  

This needs to be compared to the far more serious laboratories run by
Don Becker, Erik Hendriks, et. al. at Scyld.com, Greg Lindahl's systems,
the systems run by e.g. Walter Ligon and Rob Ross at Clemson, the
thousand node system doing genetic code development (at Stanford?) and
more, where they focus on the underlying computer science and
development of advanced infrastructure and software support.  Between
the high road (interesting computer science that might enable
interesting problems to be tackled one day) and the low road (USEFUL
results from the interesting computer science being applied to solve
real world problems right now) beowulfery has a "mind of its own" and a
unique pattern of evolution.

Really, the way the beowulf "movement" has proceeded is a fascinating
concept and well worth writing about, but don't start off by ascribing
to it a fixed goal.  It is an amalgam, a hodgepodge, it is like damascus
steel with soft grains of iron intermixed with hard grains of high
carbon cementation producing something amazingly tough >>and<< flexible.
Its evolution pattern involves the open exchange of ideas, the hard nosed
acceptance or rejection of those ideas on the basis of the economic
benefit that derives from them, the alteration of bad ideas into good
ones and good ones into better ones as clever idea are batter around in
between smart people.

Ahhh, but I digress.  Or is it regress.  In a minute I'll be spouting
poetry, "Ode to the Beowulf...";-)

> And about the evolution of the Beowulf clusters, do you all follow a kind of
> guideness or the project have divided in several flavors and objectives?

As I said, there isn't really a "project".  There isn't even a
"consortium" or "IEEE standards committee".  There is only a website
(really a LOT of websites) and an informal, unmoderated list that anyone
can join (and leave again if it doesn't suit them).  Well, it is
moderated by common consensus and occasional flames -- anybody who gets
egregiously off-topic or out of line gets razzed or roasted or both.  Or
worse, just ignored.

There are lots of places (to the list and on websites) where successful
solutions are posted and documented, and more are being developed every
day.  People participate because they want to, because they have a
>>use<< for the idea.  The process is stabilized by those folks that
have participated for a long time and by now are invested in
beowulf-oriented development as part of their research, their business
plan, their professional focus, or just their hobby.  There are a number
of folks on the list who have been doing one sort of distributed
parallel computing or another for a LONG time (which in this business is
anything over five years:-) including some of the original inventors of
the term "beowulf".

The only two flavors that have emerged on the list that are worth
mentioning are the "true beowulf" flavor, which is by definition a
collection of COTS computers interconnected by a COTS (and usually
private) network with (usually) a single "head".  The idea is that a
"beowulf" is a supercomputer assembled out of COTS parts, and so it can
be given a single "name", usually the hostname of its head node, and the
internal nodes are viewed as "parts of the supercomputer" and are not
generally accessed or utilized as separate compute entities.

However, many (possibly even most, I don't know) of the folks on the
list are interested in or use more general "clusters" of computers --
things that have been dubbed NOWs, COWs, POPs -- or hybrids, where there
is a cluster that is architected much like a beowulf and used for
relatively fine grained synchronous tasks but it is part of and
node-accessible from a larger network/cluster that might be viewed as a
NOW/COW/POP or just a plain old LAN where coarse grain or embarrassingly
parallel tasks can be farmed out and harvested.

Both groups have similar problems to solve and share a common language
of IPC's, latencies, bandwidths, communication patterns, and so forth.
Tools developed for use in one context can often be used in the other.
I'd say that the computer scientists tend to focus on the "true beowulf"
side of things (as that is where the more interesting and tougher
problems are) and the stuff they develop filters down into the more
general cluster world as appropriate to a task.

These two groups generally coexist amicably on the list, provided that
one doesn't try to call a sloppy old compute cluster (or anything
running WinNT or Win2k) a "beowulf";-).

> Are the objectives of the beggining the same as today or now you plan to
> have something like a "super SMP computer" in a distributed way (with good
> communications times). I've seen that a lot of you are focusing in the GPID
> and whole machine idea, do you think that is reachable? What are the main
> objectives vs the MPI/PVM message passing idea?
> And what about shared memory (in the HD level or the RAM level), do you take
> advantage of having this amount  of resouces?

I could write a book to answer all these questions (the ones that in
fact can be answered).  Hopefully the discussion above indicates that
many of them don't have answers (or have obvious/silly answers -- of
course the objectives change with time as the evolution of hardware and
software opens up new possibilities).  There are groups doing work on
most of the things you mention.  Many of those groups are not directly
engaged with "beowulfery" and may not even have a member on the list,
but that doesn't stop the work they do from percolating in, as long as
it is "open".  "Open" is as much a part of the definition of the beowulf
as "COTS".

> Is this idea trying to reach the objective of making parallel programs
> "independent" to the programmer? I mean, that instead of having to program
> having in mind that you are using a parallel machine you can program in a
> "normal" way and the compiler will divide/distribute the code over the
> cluster. Is this reachable or just a dream? Is somebody working on this?

I personally don't even think that it is a dream at this point -- it is
a fantasy. The space of possible program parallelizations and
reorganizations is a >>very, very complex one<<.  Sure, one can write
parallel libraries that will autoparallelize some simple "atomic"
operations (like multiplying two vectors or doing a sort).  However, to
>>optimally<< parallelize (or even execute as a single threaded task)
even something this "simple", one's compiler/library simply has to know
all sorts of things about the parallel computer on which they are to be
run.  How fast are the IPC's?  Where are the L1 and L2 cache boundaries?
How fast is memory?  What's the cost of a context switch?  This problem
is difficult enough on an SMP system, where many of the answers are
homogeneous and can in principle be made known to a parallel compiler --
in a beowulf, where there are no guarantees of homogeneity or even a
common standard of design it is nearly "impossible".

Nevertheless, I wouldn't say that nobody is working on this.  The
problem is that they are working on the tools that will enable the tools
that will enable the tools to be built that MIGHT one day permit at
least the efficient and automatic parallelization of a small subset of
the standard operations one might wish to perform -- e.g. matrix
operations and sorts and the like.  Even then I'd guess that the
programmer will have to be aware that they are programming for parallel
operation as there are issues like data organization and separability
that I doubt any compiler can manage.

Consequently I think that the day will never come when one can take, for
example, the off-the shelf source code to, say, netscape or ls or sort,
and compile it with the -parallel flag and produce a binary that will
automatically parallelize itself at all, let alone efficiently.  The
compiler will also be utterly unable to make the most important decision
of all -- that it is STUPID to parallelize netscape or ls, STUPID to run
sort in parallel for small problems, but that it MIGHT be smart to
parallelize sort for problems bigger than some systems and network speed
dependent threshold!

> And what about the administration of a cluster. Having all the machine of
> the cluster under control, so you can know which are avaliable to send some
> work, is an hazarous task but necessary. Is not as easy as in a SMP machine
> where you know or assume that all the CPUs inside are working, in a cluster
> you can't do that as the CPU might work but the HD, NIC or memory may fail.
> How much computational time do you spend in this task? There's somebody
> working in a better way to manage with this?
> 
> I know that sometime ago HP had a machine woth several faulty processors
> working and achiving high computational speeds without any error. They used
> some kind of  "control algorithm" that manages to use only the good CPUs. Do
> you have something like this or there is no point? Does it make sense?

There are plenty of people working on adminstrative tools, and a number
of tools already exist to solve many of the problems you mention
(although the tools may still need work).

Fault tolerance is a whole different question.  There are certainly
folks interested in the problem and are very likely people working on
it, but fault tolerance is >>also<< a cost-benefit problem and it has
two general KINDS of solutions.  One is to engineer the tolerance into
the underlying systems architecture.  Dual power supplies.  RAID 5 run
by dedicated controllers.  ECC memory.  This is VERY difficult to extend
to beowulf architectures and isn't easy even on SMP systems -- it is
hard to design a computer system that cannot be brought down in its
entirety by >>any<< failure of its parts, especially when one of the
parts in question is a CPU.

The second approach is to engineer the tolerance into the software using
the hardware you've got.  In many cases this involves checkpointing the
code, one way or another (whether the checkpoint goes to memory, to
disk, to another system is almost irrelevant).  The point is that code
checkpointing takes quite a lot of time regardless of the medium
compared to the work that would be done in that time, and that this time
reduces the efficiency of the program.  This is really true even if the
redundancy is engineered in at the systems level, but there clever
engineering can sometimes hide the extra work or do IT in parallel.

One then has to examine the cost-benefit equation.  It costs you X
amount of time to checkpoint at some frequency.  During the intervals
you are at risk, with some probability.  If the probability of failure is
low, the intervals that lead to the smallest expected value for the time
to completion will be large, often "infinite" (you should just run the
damn program and not bother checkpointing, and risk having to run it
over again once every five thousand or so times, because checkpointing
it even once costs you one part in a thousand in performance).  In other
cases (for example, when you're going to run a program that takes a year
to complete on 1000 nodes at once) the interval may need to be
relatively short just to ensure that the problem EVER completes.

Confronted with this, for many folks on the list engineering
fault-tolerance into their programs is a total waste of time and a
cost-benefit loss.  One can write fault tolerant parallel software NOW
with existing tools, but it is a lot of work and will slow down your
job, possibly quite a bit.  Failover is usually more interesting to
folks who use clusters in different ways than are discussed on the list
-- for example corporate parallelized database servers or really big
(multisystem) webservers like yahoo.  In these cases, the "cost of
failure" (even one failure every umpty-ump days) may be so high as to be
"unacceptable", even compared to the relatively high costs of fault
tolerance.  These operations aren't really "beowulfs" although of course
they are "close" and their operators/designers would be welcome on the
list.

At a guess, this is the kind of problem that will -- eventually -- be at
least partly addressed by work being done at a number of places.  I
believe that there is at least one group working on certain core pieces
of software that will build beowulf support directly into the kernel,
where it can benefit from increases in speed and efficiency and where
one can BEGIN to think about issues like fault tolerance at a lower
level than program design.  This is the kind of thing the "true beowulf"
computer science groups think about.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu







More information about the Beowulf mailing list