Beowulf Questions

Mon Jan 6 07:53:13 PST 2003

On Mon, 6 Jan 2003, Mark Hahn wrote:

> OK, so grid is just cycle scavenging with its own meta-queueing,
> its own meta-authentication and its own meta-accounting?

And perhaps most important (but not yet significantly implemented,
although there is a very serious project here at Duke to implement it
called Computers On Demand -- COD) -- a meta-OS-environment and
meta-sandbox for the distributed users that can be loaded literally on
demand (at a suitable time granularity, of course:-).
Multiuser/multitasking with a vengeance, where "the network is the
computer" on a very broad scale indeed.

Mark, you shouldn't discount the economics of cycle scavenging or refer
to it as "just" that.  In once sense, all multiuser/multitasking
computing is cycle scavenging, but who would deny its benefit?  Even
now, things like Scyld can be booted from e.g. floppy on a node, leaving
the node's hard disk and primary install intact.  Or, people can install
two or three or ten bootable images on a modern disk and choose between
them with grub.  Surely it isn't crazy to develop tools to take the
individual craft and handwork out of these one-of-a-kind solutions and
make them generally and reliably implementable?

Just to give you a single example of the economics that drive this
process, Duke has gone from a couple of clusters (mine and one over in
CS) to literally more clusters than the University per se can track over
the last five years.  From order 10-100 nodes total to thousands of
nodes in tens of departments in maybe 100 independent groups.

Some groups (the EP folks like myself) are always and inevitably
cycle-hungry.  They build all the nodes they can afford and run on them
continuously.  They don't need much in the way of network.  They do need
a "known environment" on the nodes -- e.g. PVM or MPI or GSL or ATLAS or
ETC libraries, the right OS and release number to support their
binaries, appropriate permissions.  Other groups need more nodes than
they can ever afford to buy, but only for three short weeks a year.
When they need them, they REALLY need them, but the rest of the time
they are doing something else (e.g. analyzing the results of those three
weeks, thinking, writing, teaching).  Some groups need tightly coupled,
synchronous clusters.  Others are EP.

The potential benefits that can be obtained by providing a suitable
interface to permit these various groups, with their widely disparate
needs and usage patterns to mutually optimize their investment and usage
patterns across ALL organization-level (in our case initially
departments, eventually perhaps the University itself) cluster resources
are significant -- equal or greater in value to the total value of all
of those resources, presuming that the AVERAGE resource utilization is
likely to be below 50% as things currently stand (a not unreasonable
number, BTW).

This may not be Grid Computing (big G, on the RC5/Seti scale), but is
rather grid computing (small g, on a purely sensible institutional
scale) within a single organization with a single domain of trust, an
adequate backbone and other infrastructure, and a unified model and
toolset for permitting optimization of the utilization of its resources
at the institutional level rather than the individual research group
level.

Here it makes sense, I think, although time and experience will prove
this right or wrong.  At any rate, there is a clear economic benefit
that drives the development process -- it remains to be seen whether or
not it can be realized.  The tools being developed will really
revolutionize "node" resource allocation, by the way.  In principle they
will allow the automated recovery of cycles wasted at night on e.g.
computer systems that support the undergraduate physics labs here -- by
day they run NT and are in use by students.  By night they are totally
wasted and sit idle, consuming electricity that the University pays for
to no visible profit, while my work limps along on all the systems I can
afford but would greatly benefit from more.  On other scales the toolset
might control the allocation of a department-wide compute cluster among
four or five groups of researchers on an access-granularity scale of
hours to days, or reallocation of the undergraduate clusters provided as
"terminals" to the network to EP tasks during holidays and breaks.

I expect/hope that over the next two or three years, GPL tools will be
developed (some of them here by Justin Moore and Jeff Chase) and
perfected that permit a midsized or larger organization to increase
their utilization efficiency for a wide range of compute cluster/node
resources by a factor of 2-3 with no significant degradation of security
and with an overall INCREASE in the productivity of just about everybody
associated with a shared group.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu