[Beowulf] Differenz between a Grid and a Cluster???

Thu Sep 22 08:04:45 PDT 2005

Joe Landman writes:

> In a nutshell, a grid defines a virtualized cloud of processing/data 
> motion across one or more domains of control and 
> authentication/authorization, while a cluster provides a virtualized 
> cloud of processing/data motion across a single domain of control and 
> authentication/authorization.  Clusters are often more tightly coupled 
> via low latency network or high performance fabrics than grids.  Then 
> there is the relative hype and the marketing/branding ...

Agreed.

>> To be really fair, one should note that tools have existed to manage
>> moderate cluster heterogeneity for single applications since the very
>> earliest days of PVM.  The very first presentation I ever saw on PVM in
>> 1992 showed slides of computations parallelized over a cluster that
>> included a Cray, a pile of DEC workstations, and a pile of Sun
>> workstations.  PVM's aimk and arch-specific binary path layout was
> 
> aimk is IMO evil.  Not PVM's in particular, but aimk in general.  The 
> one I like to point to is Grid Engine.  It is very hard to adapt to new 
> environments.

<warning>
  patented rgb ramble follows, hit d now or waste your day...:-)
</warning>

Oh, I agree, I agree.  aimk and I got to be very good -- "friends", I
suppose -- some years ago.  It was a godawful hack, and it caused me
significant disturbance to learn that it was the basis of SGE even
today.  But of course, you also have to look at the magnitude of the
task it was trying to -- its complexity was just a mirror of the
incredible mix of Unices and hardware architectures available at the
time.  You had your bsd-likes (SunOS), your sysv-likes (Irix), your
let's-make-up-our-own-likes (NeXT OS anyone:-), your
unix-doesn't-have-enough-management-commands-likes (AIX of Evil).  These
changed everything from basic paths to /etc layout to the way flags
worked on common commands to using nonstandard commands and nonstandard
tools to do nearly ANYTHING.  aimk used rampant screen scraping just to
identify the architecture it was running on, since there weren't
anything like standard tools for doing so.  And it was just as bad for
the users -- who would e.g. try using ps aux on a sysv box when they
should've been using ps ef.

This problem hasn't really gone away, it's just one that we don't
usually face any more.  Once SunOS was "the" standard unix, such as it
was, by virtue of selling more workstation seats than everybody else put
together and by being the basic platform upon which open source software
was built (it would nearly always build under SunOS if it built on
anything, and maybe it would build under Irix, or CrayOS, or AIX -- but
no guarantees. Now Linux is "the" standard unix, by virtue of "selling"
more server AND workstation seats than everybody else put together AND
by being the basic platform upon which "most" open source software is
built, with its primary friendly open source competitor (freeBSD)
maintaining a fairly deliberate compatibility level to make it easy to
port the vast open source library back and forth between them.

In addition, the emergence and moderate degree of standardization of the
lsb_release command and the uname -a command have made it possible to
determine distribution, release, kernel, hardware architecture, and base
architecture without resorting to wildass looking for a file in some
particular path that is only to be found there in version 7.1 of
widgetronic's Eunix, if it is installed on a Monday.  It's a moderate
shame that they didn't just name it the release command so that the
BSDs, solaris, and others can play too, but you can't have everything.

The modern solution to this problem at the build level in the Open
Source world is to use Gnu's autoconf stuff.  This operates at a level
of complexity that puts aimk to shame, and hides most of that complexity
from the user as long as it works, which it does for precisely those
architectures (e.g. linux) that are all homogeneously built anyway and
don't generally much need it.  If nothing else, it does provide one with
a sound way to build a Makefile or package that should "work" (build,
install, function) as far as e.g.  include paths and so on across a
wide-er range of architectures than would likely otherwise be the case.
Sometimes without actually having to instrument the code with all sorts
of #ifdefs.

MY own solution to all of this is to just say screw it.  I write
software that does not use autoconf, for the most part -- numerical code
it just isn't that complex.  GUI code, sure.  Code that uses a lot of
high end libraries, maybe.  But numerical code that links -lm and the
GSL and perhaps a network library or pvm, why bother?  I create rpms
that will (re)build cleanly and easily on all the linux platforms I have
access to.  If anybody wants to use the (GPL) tools elsewhere, somewhere
that is really incompatible with this (solaris, Irix, AIX, WinXX), or
even on a debian system -- well, that's why the GPL is there.  They can,
they just have to do any repackaging or porting required themselves.  I
may "help" them, I may even swallow any #ifdef'd patches back into the
program if they don't bother the linux build, but I reject BOTH aimk AND
Makefile.am, Makefile.in, etc.  I like to have just one Makefile, human
readable and human editable, instead of e.g. Makefile.irix,
Makefile.linux, Makefile.aix or Makefile.am and Makefile.in where
much of the Makefile you use is completely beyond your ability to
control.

So I'm atavistic and curmudgeonly.  I also think that while diversity is
important for evolution to occur, deliberate diversity of proprietary
source material that seeks to "lock in" users to a nonstandard
environment where the nonstandardness is all in >>places that do not
matter<< tends to SURPRESS evolution, by eliminating the sharing of
memes.  Microsoft is famous for this, but go down the list -- Sun, IBM,
DEC, Cray, SGI, they ALL do this.  By investing energy to make a tool
portable across their deliberate code-breaking differentiation, you only
encourage them and permit them to continue a crepuscular existence where
they can present to their waning body of users the illusion that they
too can provide the full range of open source packages that are out
there and are hence "as good as" (or even better than) plain old linux
(any flavor) or freebsd.  

They are incorrect, of course -- the number of packages, and level of
sophistication of the packages, available in e.g.  FC4 + extras + livna
+... is staggering.  Even commercial software vendors, if they ever come
to understand how, could use tools like yum to COMPLETELY AUTOMATE the
distribution and updating of their properly packaged software in a
totally reliable way to the licensed desktop or cluster node -- it
merely requires the right combination of handshaking, encryptions, and
keys.  So the sooner open source package maintainers stop expending
significant resources to ensure buildability across the commercial
unices (except where the commercial vendors directly support the port
with real money), the sooner the commercial incompatible unices will
just go away.  Evolution involves BOTH sharing of genes AND the
elimination of the less fit.  

Right now linux and freebsd are enjoying the fruits of robust
gene-sharing and an INTERNAL culling of the crap -- applications are
contantly being deprecated and die away, with the ultimately democratic
process used to determine what lives -- ANYTHING can live as long as
somebody loves it and will invest the time required to maintain it, but
the last person out please turn out the lights and close the door.
Nothing is wasted -- the code base persists and can be looted for new
projects.  There is an interesting sort of "competition" between linux
distributions themselves -- made interesting because they all share well
over 90% of their code base, so it is a competition among siblings
pursuing a lover (the "user"), not a war between two species seeking to
take over an ecological niche.

It is this that makes Microsoft's recent MPI announcement so very
interesting.  If history is any predictor of the future, Microsoft's
business strategy is perfectly clear.  They've identified clusters as a
real market with both direct profit potential and with "prestige"
(market clout) associated with it.  Even Apple gets systems into the Top
500, where Microsoft gets laughed at.  SO

  a) co-opt the technology that created the market "Microsoft introduces
its own compatible version of MPI".

  b) sell lots of systems (they hope) into Microsoft-branded clusters.
Invest any and all resources required to help users port their
MPI applications into Microsoft-based compilers, toolsets, and so on.
Develop windows-based cluster management tools so that they can
comfortably manage their clusters from the Windows desktop without
having to really "know" anything.

  c) when they have what they deem to be an adequate market share,
introduce the first INcompatibilities.  "Extend" MPI.  Get users to use
their proprietary or at least nonstandard extentions.  Rely on the
immense cost of de-porting the applications out of their development
environment and use this to leverage the death of their competitors.
Take over the cluster marketplace when people who use alternative MPIs
or cluster OS's are no longer able to run the key applications built to
use WinMPI.

Of course, step c) (which in the past has led to the DEATH of their
competitors) won't work against an open source competitor that doesn't
run on their platform in any significant numbers anyway.  That is, they
stand to gain or lose all ten of the people using an open source MPI on
Windows clusters, or something like that up front.  The one place where
they might be able to make inroads is in clusters that use proprietary
tools to e.g. do genetics or other kinds of work.  Even that will depend
on some armtwisting of those vendors into dropping their support for
linux-based platforms.  I can see no good reason for them ever doing so
as long as linux clusters continue to be strongly represented in the
marketplace, and I see no reason that MOST users will be convinced to
use Microsoft based clusters given that they will be much, much more
expensive and fundamentally less portable.

> When you run on multiple heterogenous platforms and you are dealing with 
> floating point codes, you need to be very careful with a number of 
> things, including rounding modes, precision, sensitivity of the 
> algorithm to roundoff error accumulation at different rates, the fact 
> that PCs are 80 bit floating point units, and RISC/Cray machines use 
> 32/64 bits and 64/128 for doubles.  It could be done, but if you wanted 
> reliable/reasonable answers, you had to be aware of these issues and 
> make sure you code was designed appropriately.

Right.  Again, in the pre-posix days this was essential.  Even today it
doesn't hurt to be aware.  However, it is now "possible" to write
code that works pretty well across even various hardware architectures.
Usually the issue is one of efficiency, not actual "getting the wrong
answers" unless one does something really stupid and presume that e.g.
an int is 32 bits no more no less.  If one uses sizeof() and friends or
the various macros/functions intended to yield precision information,
you can write portable code that is quite reliable.

I think the more interesting issue IS efficiency.  ATLAS stands as a
shining example of what can be gained by paying careful attention to
cache sizes and speeds and the bottlenecks between the various layers of
the memory hierarchy as one switches block sizes/strides and algorithms
around.  There are really significant differences in BLAS efficiency
between a "generic" blas and one that is atlas-tuned to your particular
hardware.  I think that this is what Mark was referring to -- if
somebody writes their linear application with a stride suitable for a 2
MB cache and runs it on a CPU with a 512K cache it will run, but it
won't run efficiently.  If you do the vice versa, the same is true.  The
efficiency differential CAN be as much as a factor of 3, from what ATLAS
teaches us, so one can WASTE part of the potential of a node if one is
mistuned -- not so much get a wrong answer as waste the resource.  

To the user this may or may not be a problem.  If I have a problem that
will take me a week to run on a really large grid, that might run in
only three days if I spend three weeks retuning it for all the
architectures represented (presuming I know HOW to retune it) well, the
economics of the alternatives are obvious.  Similarly if I have a job
that will run for a year on the grid if I don't spend three weeks
retuning it per architecture and in six months if I do, well, those
economics are obvious too.  

And in fact, most grid users will just run their code in blissfull
ignorance of the multiarchitecture problem and the fact that their code
might well run 3x faster on the ORIGINAL architecture if it were
rewritten by a competent programmer.  To my direct experience, whole
blocks of physics code is written by graduate students who had maybe two
whole courses in programming in their lives, who wrote in Fortran 77 (or
even Fortran IV), and who thought that the right way to program a
special function was to look up a series in Abramowitz and Stegun and
implement it in a flat loop.  As in sometimes they even get the right
answer (eventually, after a bit of rewriting and testing)...;-)

> [...]
> 
>> Some of the gridware packages do exactly this -- you don't distribute
>> binaries, you distribute tarball (or other) packages and a set of rules
>> to build and THEN run your application.  I don't think that any of these
>> use rpms, although they should -- a well designed src rpm is a nearly
> 
> RPM is not a panacea.  It has lots of problems in general.  The idea is 
> good, just the implementation ranges from moderately ok to absolutely 
> horrendous, depending upon what you want to do with it.  If you view RPM 
> as a fancy container for a package, albiet one that is slightly brain 
> damaged, you are least likely to be bitten by some of its more 
> interesting features.  What features?  Go look at the RedHat kernels 
> circa 2.4 for all the work-arounds they needed to do to deal with its 
> shortcomings.
> 
> I keep hearing how terrible tarballs and zip files are for package 
> distribution.   But you know, unlike RPMs, they work the same, 
> everywhere.  Sure they don't have versioning and file/package registry 
> for easy installation/removal.  That is an easily fixable problem IMO. 
> Sure they don't have scripting of the install.  Again, this is easily 
> fixable (jar files and the par files are examples that come to mind for 
> software distribution from java and perl respectively).

Sure, and when you're done you'll have reinvented an rpm, or close to
it.  

Without doubt, a reinvention can avoid some problems with the original,
whether it is a version bump or a complete new start.  Subversion vs CVS
comes to mind (because I'm trying to make that transition myself at this
moment in time).  It also generally introduces new ones, or reveals
problems that the reinvention could've/should've fixed but didn't.  As
in subversion has some very annoying features as well, features that
actually COMPLICATE maintaining a personal repository of a large base of
packages compared to the much simpler but feature poor CVS.  I'd have
much rather seen an update of CVS to fix its flaws instead of a
completely new paradigm with new flaws of its own that fixes USER LEVEL
features more than anything else.  Introducing e.g. cvs move and cvs
remove seems like it is a lot simpler than creating an entire berkeley
db back end.

In this particular case, if you want to see dark evil, check out pacman
(the ATLAS -- HEP/DOE ATLAS grid not linear algebra ATLAS solution to
this problem):

  http://physics.bu.edu/~youssef/pacman/

Note well the "linux-redhat-7.1" in their examples.  Yes, they mean it.
Telling you why would only end up with all of us having a headache and
being nasty to children and pets for the rest of the day.

That is, most reinventions of rpms are going to not only reinvent
wheels, they will reinvent wheels BADLY (and of course have most of the
problems with permissions and location and so on that rpms will have as
those problems are fundamental to the nature of the grid and have
nothing to do with the packaging mechanism per se.

To be blunt, a good packaging mechanism will have the following:  a
compressed archive of the sources; a patching mechanism; an automated
build/rebuild process, versioning; dependencies and dependency
resolution; metadata; pre and post install scripts; both install and
deinstall mechanisms.  rpms have all of this but the dependency
resolution part; first yup and then yum added dependency resolution (and
a better handling of metadata).  Other packaging products either will
have these features or they will be distinctly inferior.

Everything else is just how the features are IMPLEMENTED.  Using a
tarball as the toplevel wrapper vs a cpio archive is an irrelevant
change, for example.  rpms could very definitely manage things at the
rpm db level better, and it is still POSSIBLE to build in circular
dependencies and stuff like that, and there are always questions about
how to obsolete things and whether or not it is possible to remove
something and put something in that eventually replaces it without
forcing the removal of its entire dependency tree and ITS reinstallation
as well.  However, most of the design decisions that make these things
are conservative and defensive -- if you follow their rules (or yum's
rules) you make it DIFFICULT to break your system where rpm --force or
the unbridled installation of tarballs onto an rpm-based system will
INEVITABLY break your system.

Frankly, having managed systems and wrestled with this for decades now,
I think rpms (augmented by yum) are a little miracle.  Not perfect, but
for a bear they dance damn gracefully.  I am absolutely certain that one
could replace pacman with rpms plus yum on top of any of the
conservative distros (e.g. RHEL and derivatives such as CentOS, SuSE,
probably even FC, CaOSity, Scientific Linux and less conservative
deriviatives) and end up with something far, far more robust.  Just
look at the example in e.g.

  http://physics.bu.edu/~youssef/pacman/htmls/Language_overview.html

and see how many places where you see functionality that is redundant
with rpms, only less robust.  Hell, you could make a toolset called
"ms_pacman" (sorry:-) that replaced the entire thing WITH a yum
repository or a gentoo-like rpm build mechanism if only the sandbox
issue was worked out.  And there there are multiple solutions -- it is
only peripherally an rpm issue.  For yet another "solution", make
/usr/local the binary sandbox for user-based rpms -- all OS rpms comply
with the FHS so /usr/local is completely free.  At the end of a
computation, blank /usr/local.  Strictly control / otherwise -- program
dependencies to be filled only from the approved/controlled grid
distribtion repos.  Make sure /usr/local is off of root's path (or use
some other path for the sandbox).

It just isn't that hard, except when you use systems that don't really
comply with rpm's standards.  If somebody wants to give me a grant to
write it, I'd cheerily do so and contribute it to the public weal:-).

> We have seen up to a factor of 2 on chemistry codes.  If your run takes 
> 2 weeks (a number of our customers take longer than 2 weeks), it 
> matters.  If your run takes 2 minutes, it probably doesn't matter unless 
> you need to do 10000 runs.

Exactly.  Might even get a factor of 3.  Sometimes it matters, more
often it doesn't, and usually there are other factors of ~2 to be had
from truly optimizing even on i386...

> It is not hard to manage binaries in general with a little thought and 
> design.  It is not good to purposefully run a system at a lower speed as 
> a high performance computational resource unless the cost/pain of 
> getting the better binaries is to large or simply impossible (e.g. some 
> of the vendor code out there is still compiled against RedHat 7.1 on 
> i386, makes supporting it ... exciting ... and not in a good way)

Ahh, then you'll REALLY like pacman and the ATLAS grid.  Clearly an
"exciting" project -- largely because they've been very, very slow to
recognize that in the long run it costs FAR MORE to run on obsolete
operating systems than it does to bite the bullet and port their
(admittedly vast) code base to something approximating modern compilers
and posix compliance let alone hardware optimization.

So multiarchitecture gridware isn't that hard -- even ad hoc hacked
wheel-reinventing crap can be made to function.  It could be (and
probably is) also done well, of course, if anybody ever bothers.

>> For most of
>> the (embarrassingly parallel) jobs that use a grid in the first place,
>> the point is the massive numbers of CPUs with near perfect scaling, not
>> how much you eke out of each CPU.
> 
> Grids are used not just for embarrassingly parallel jobs.  They are also 
> used to implement large distributed pipeline computing systems (in bio 
> for example).  These systems have throughput rates governed in large 
> part by the performance per node.  Running on a cluster would be ideal 
> in many cases, as you will have that nice high bandwidth network fabric 
> to help move data about (gigabit is good, IB and others are better for 
> this).
> 

Agreed.  However, those grids are a bit different in architecture.
There the grid is a union of clusters (basic definition) but each
cluster is actually architected for the problem at hand.  ATLAS actually
was in this category -- there too data storage issues were as or more
important than raw processing power.  The nature of the HEP world is to
process massive data sets from runs, intermixing (for example) monte
carlo simulations and other data transformations.  Hundreds of terabytes
(you'll recall from our discussions back then;-) was the STARTING point
for a participating cluster center, with a clear scaling pathway to
petabytes as the technology evolved.

So perhaps what I should have said (and did say, elsewhere) is that for
most grids there is an immediate benefit to having more nodes/correctly
architected clusters, even if you don't fully optimize per node.  There
is MORE of a benefit if you DO correctly optimize, where correctly
optimize is at least but not limited to rebuild for the hardware
architecture at hand (remake for x86_64 vs reuse i386 binaries). This is
the kind of optimization that should quite reasonably be handled
automagically by the gridware package.  And where it MAY be much more --
instrument the code to take advantage of cache sizes, relative memory
bandwidths, use specific constellations of supported but nonstandard
hardware instuctions.  You've already pointed out the basic economics of
whether or not it is worthwhile to reach for this sort of optimization,
which cannot be done, in general, by a gridware package management
system except where it e.g. decides to install the correct (optimized)
version of libraries such as ATLAS (the linear algebra system).

> Rapidly emerging from the pipeline/grid world for bio computing is 
> something we have been saying all along, that the major pain is (apart 
> from authentication, scheduling, etc) data motion.  There, CPU 
> speed/type doesn't matter all that much.  The problem is trying to force 
> fit a steer through a straw.  There are other problems associated with 
> this as well, but the important aspect of these systems is measured in 
> throughput (which is not number of jobs of embarrassingly parallel work 
> per unit time, but how many threads and how much data you can push 
> through per unit time).  To use the steer and straw analogy,  you can 
> build a huge pipeline by aggregating many straws.  Just don't ask the 
> steer how he likes having parts of him being pushed through many straws. 
>   The pipeline for these folks is the computer (no not the network). 
> Databases factor into this mix.   As do other things.  The computations 
> are rarely floating point intensive.
> 
> Individual computation performance does matter, as pipelines do have 
> transmission rates at least partially impacted by CPU performance.  In 
> some cases, long pipelines with significant computing tasks are CPU 
> bound, and can takes days/weeks.  These are cases prime for acceleration 
> by leveraging the better CPU technology.

Yes yes yes.

>>> in that way of thinking, grids make a lot of sense as a 
>>> shrink-wrap-app farm.
>> 
>> Sure.  Or farms for any application where building a binary for the 2-3
>> distinct architectures takes five minutes per and you plan to run them
>> for months on hundreds of CPUs.  Retuning and optimizing per
>> architecture being strictly optional -- do it if the return for doing so
>> outweighs the cost.  Or if you have slave -- I mean "graduate student"
>> -- labor with nothing better to do:-)
> 
> Heh... I remember doing empirical fits to energy levels and band 
> structures and other bits of computing as an integral part of the 
> computing path for my first serious computing assignment in grad school. 
>   I seem to remember thinking it could be automated, and starting to 
> work on the Fortran code to do.  Perl was quite new then, not quite to 
> version 3.
> 
> Pipelines are set up and torn down with abandon.  They are virtualized, 
> so you never know which bit of processing you are going to do next, or 
> where your data will come from, or where it is going to until you get 
> your marching orders.  It is quite different from Monte Carlo.  It is 
> not embarrassingly parallel per node, but per pipe which may use one 
> through hundreds (thousands) of nodes.
> 
> Most parallelization on clusters is the wide type:  you distribute your 
> run over as many nodes as practical for good performance. 
> Parallelization on grids can either be trivial ala Monte Carlo, or 
> pipeline based.  Pipeline based parallelism is getting as much work done 
> by creating the longest pipeline path practical keeping as much work 
> done per unit time as possible (and keeping the communication costs 
> down).  Call this deep type parallelism   On some tasks, pipelines are 
> very good for getting lots of work done.  For other tasks they are not 
> so good.   There is an analogy with current CPU pipelines if you wish to 
> make it.

Very interesting.  So perhaps for certain CLASSES of task one can create
automagical optimizers either for the build process or the execution
process, provided that you have tools that can directly measure or
extract critical information about bandwidths, latencies, bottlenecks
along the pipeline pathway both withing nodes and between nodes.

Which I completely believe of course, and which is the basis for the
xmlbenchd project that I've started and had NO time for for five months
or so.  I did get the core benchmark code (ex cpu_rate) to spit out
xml-wrapped results in time to show Jack Dongarra when he visited, and
there are a few other humans in the world working on this at the
computer science level.  It does seem to be an essential component of
building "portably efficient" grid applications if not general cluster
applications, where one CAN optimize on YOUR cluster but might prefer to
use generic and portable tools to do so.

    rgb

> 
> Joe
> 
>> 
>>    rgb
>> 
>>>
>>> regards, mark hahn.
>>>
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org
>>> To change your subscription (digest mode or unsubscribe) visit 
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>> 
>> ------------------------------------------------------------------------
>> 
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 
> -- 
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics LLC,
> email: landman at scalableinformatics.com
> web  : http://www.scalableinformatics.com
> phone: +1 734 786 8423
> fax  : +1 734 786 8452
> cell : +1 734 612 4615
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050922/6a1c4028/attachment.sig>