question about the operation of beowulf

Fri Dec 8 09:28:30 PST 2000

On Fri, 8 Dec 2000, Geneiatakis Dimitrios wrote:

> If i have to do a task in a beowulf how the computers communicate
> to do this work ?

Generally speaking, they use some sort of socket connection.  The
subtask on node 1 uses a socket to send messages/data to node 2, for
example.  I use the word "socket" here very generically, as it may or
may not be a typical TCP or UDP socket.

Having written raw socket code for a few applications at this point I
can safely and authoritatively say that while it isn't terribly
difficult if one has good books and documentation and a few examples to
work from, it is undeniably a pain in the butt to write your own raw
socket code for a parallel application.  There are many things to figure
out and many "wheels" to reinvent.  It is also very easy to make
something that almost works but breaks, leaving you with a fairly
significant debugging problem.  Finally, there are scaling issues with
raw socket code -- it isn't too difficult to construct a pair of
sockets, but a matrix of pairs of sockets between all pairs of nodes
that scales transparently with the number of nodes and provides one with
adequate information and controls over the node state is a horse of a
different color (and a LOT more work).

For this reason, most folks that write beowulfish parallel applications
use a library encapsulation of some sort of raw sockets that provide a
relatively high end interface to a network of interprocessor
communications (IPC) links.  The two most common ones are PVM and MPI.
PVM stands for "parallel virtual machine" and was originally written
strictly to wrap raw IP sockets and provide a daemon-level interface
between parallel program components, which could be running on literally
any old heterogeneous combination of hosts.  It was written (IIRC) at
Oak Ridge by interested volunteers and provided via "netlib" -- which
might be viewed as an early realization of the web that didn't quite
catch on.

MPI, on the other hand, was born out of user's frustration with the
programming interfaces of early parallel/SMP supercomputers.  In the
relatively early days of parallel supercomputers, every vendor had their
own proprietary communications layer and parallel API, and the
supercomputers themselves obsoleted every 2-4 years (or their users
moved to a different site where a different API reigned).  A lot of
users found themselves investing two years parallelizing their code for
a given API only to have to turn around and do it all over for a
different hardware API.  Parallel code (which is still very "expensive"
to develop) was just not portable.

The government and key industries finally decided enough was enough and
insisted that although supercomputer vendors could do what they liked
with their proprietary interfaces, they were only going to buy systems
with a common API that permitted at least moderate movement of code from
system to system.  Faced with this, the key supercomputer vendors
(possibly regretfully, as I'm sure each hoped that their own proprietary
API would somehow get branded and trump that of its competitors and give
them a corporate/government monopoly stabilized by the cost of porting
code, not unlike the stabilization of the Microsoft monopoly today)
formed a consortium and wrote a common "message passing interface" to
serve as a common wrapper for their proprietary communications systems
calls.

Eventually open source versions of MPI appeared and a network interface
was added that allowed it, too, to form the library foundation for IPC's
over a public IP network.  MPI and PVM thus became functionally
equivalent (for the most part, at least) libraries with very different
roots and with somewhat different superstructure support for
network-distributed parallel computations.

To my experience, folks coming from "big iron" supercomputers to
beowulfery tend to favor MPI as it facilitates the movement of their
code and is what they are used to; folks coming from distributed
parallel computing in network environments probably started with PVM and
tend to prefer that.  Were I to endorse one (or even acknowledge the one
I started with and hence prefer:-) I would stimulate a religious jihad
by the faithful of the other -- which may happen anyway.  You should
likely look them both over carefully to see which one you are likely to
find codophilosophically compatible before choosing one or the other.

A few other high end tools exist to permit certain classes of "IPCs" to
exist between collective parallel processes, e.g. MOSIX, but these tools
provide even more insulation between the programmer and the parallel
communications process, wrapping ordinary file and socket I/O calls in
kernel-based network layer insertions.  Or, of course, one can create a
file based I/O layer on top of e.g. NFS or any other synchronous shared
file system.  These tend to be relatively inefficient and slow, but for
some purposes this doesn't matter and I've certainly used NFS (with rsh,
rcp, and expect) as an "IPC" mechanism for embarrassingly parallel
master/slave task management and communications for very large clusters
on different networks where PVM might have been rude.  Because of
resource limitations in e.g. the number of open socket connections in
the kernel, sometimes this sort of trick can come in handy for other
reasons as well.

Let's see, to finish up:

  PVM and MPI are both available in most of the major linux
distributions as well as from their home sites.  Use a search engine or
visit www.beowulf.org, the beowulf underground (linked to the main
beowulf page) or nearly any other major beowulf's website e.g.
www.phy.duke.edu/brahma to find links to PVM and MPI sites.

Hope this helps:

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu