Cluster Monitoring software?

Tue Nov 7 07:58:18 PST 2000

On Mon, 6 Nov 2000, Dan Yocum wrote:

> Robert,
> 
> I'm easily confused: in the chart, is the Alpha on the top or the bottom
> of the graph?  I'm assuming that it's on the top, but then I see that
> the Thunderbird falls below the Athlon (I'll admit that I don't know the
> difference between the 2), which throws me.
> 
> Thanks for the clarification.

[OK, this is a long one.  I should just give up and write a journal.
Folks with real work to do should just hit "d" or put it off to later.]

Curiously and interestingly, the Alpha >>is<< on the bottom of the chart
in L1 and much of L2, slower than the lowly 466 MHz Celeron in L1.
However, it surpasses and blows away all of the CPUs when running out of
memory, as it has a much bigger L2 and much faster main memory.  Some of
the alpha guys at ALSC were concerned that there is "something wrong
with this picture", but I assure you it is a good faith result -- I have
nothing against the alphas.  I've also just now tested it again and it's
still slower.  It could be just me or an error in the way I'm doing
things, of course, by I welcome outside validation and the cpu-rate
source is right there on brahma for anyone to grab or critique.

I will say that I have corroborative evidence for this being a correct
(or not radically incorrect) ranking in the form of time to completion
of my Monte Carlo code.  This is code for which "cache works" to a fair
degree -- it is typically not memory I/O bound.  Here are a small subset
of my results of benchmarking this on various platforms for nearly a
decade:

#============================================================
# Benchmark run of On_spin3d on host lucifer
# CPU = Celeron 466, Total RAM = 128 MB PC66 SDRAM
# L = 16
# Time = 46.48user 0.03system 0:46.67elapsed 
#============================================================
# Benchmark run of On_spin3d on host brahma
# CPU = 400 MHz PII, Total RAM = 512 MB PC100 SDRAM
# L = 16
# Time = 53.01user 0.03system 0:54.55elapsed
#============================================================
# Benchmark run of On_spin3d on host qcd2
# CPU = 667 MHz Alpha EV67, Total RAM = 512 MB
# Compiler = gcc
# L = 16
# Time = 66.08user 6.14system 1:12.25elapsed 
# Compiler = ccc
# Time = 23.79user 0.00system 0:23.84elapsed 

Note the near perfect clock speed scaling for Intel (PII and Celeron),
which have PC100 and PC66 SDRAM, respectively.  This suggests that the
code isn't I/O bound and that even a small cache works "well enough" for
this program.  Running it on the XP1000 qcd2 (yes folks, "quantum
chromodynamics 2":-) we see BOTH that gcc does very poorly AND that the
Compaq compiler ccc does nearly 3x better, but nowhere nearly as well as
one might expect based on the "Alphas are 3x faster than Intels" rule
I've heard propounded.  In fact, multiplying it out we find that >>for
this code<< alphas are only a bit less than 1.4x faster than Intels, if
one uses ccc on the alphas and gcc (both at just -O3) on both.  This
allows me to predict that a 933 MHz PIII should just about precisely
match the alpha in time to completion.  By great good fortune, I happen
to be sitting at a brand new 933 MHz PIII this morning (the newest
incarnation of "ganesh") and (wait for it...)

#============================================================
# Benchmark run of On_spin3d on host ganesh
# CPU = 933 MHz PIII, Total RAM = 512 MB PC133 SDRAM
# Time = 23.53user 0.02system 0:23.75elapsed

which seems near enough.  Nice, predictable performance, gotta love it.

Even this is just a "snapshot" type result, though.  If I vary (for
example) the lattice size I can shift the comparative speeds by a fairly
large factor, and not always in predictable directions.  Perhaps there
is indeed a range where the alphas are 3x faster.  There are certainly
ranges where they aren't.  And then there is the cost differential to
consider.

Another interesting result is that on my MC code, using the Compaq
reference compiler ccc is clearly absolutely necessary.  Without it, a
667 MHz 21264 is SLOWER on my program than a 400 MHz PIII.  The OnSpin3d
code itself is a very mixed lot of floating point, trancendental calls,
and integer arithmetic (mostly for managing loops and matrix
displacements).  I suspect that it is the trancendental calls (I use in
various places exp, ln, sqrt, sin, (a)cos, (a)tan) that are really
screwing gcc here, but haven't microscopically investigated.  It could
also be something silly like moving the RNG in and out of cache -- I use
a LOT of uniform deviates.  That is, I might be able to go in and hand
tune and get rid of most of the factor of three, or I might not.

On the other hand, on the cpu-rate code, gcc is FASTER than ccc.  I
know, it can't be or shouldn't be or whatever you like.  But when I run
this program on my machines today (just now) it is, at least with -O
through -O3.  About 15% faster in upper L1 which drops to 8-10% faster
in mid-L2 and beyond.  Of course, I do no hand tweaking of the compiler
flags for the benchmark -- I just use straight O3 for all runs or I'd be
comparing apples to oranges and my results wouldn't be extrapolable for
a typical physicist (who knows just about enough to set -O3 in the
Makefile for his code, but who often has neither time nor inclination to
go in and figure out if any of the esoteric compiler tuning flags
improve performance by N%).

This adds a bit of credence to the idea that gcc's alpha port "failures"
are in either the support libraries for e.g. trancendentals or in the
overall optimizing of code structure.  This simple (cpu-rate) program
gets little benefit from a structural reordering or e.g. function
inlining -- its structure as about as simple as it gets already -- and
it has no trancendental calls.  gcc does excellently well and turns out
to be the best compiler.

OnSpin3d, on the other hand, is a program written strictly to cleanly
translate and study some physics equations -- it is organized to make it
easy to understand and easy to maintain and easy to alter or explore the
physics, not optimize its numerical execution (in the easy to maintain
and understand and port vs raw speed on a given architecture tradeoff,
the former wins in my book as it is MY time that is spent on maintenance
but I can always find other things to do while my systems trundle along
doing the calculations).  The alpha compiler is likely "smart" enough to
restructure the actual code so that it runs efficiently in spite of my
best efforts to make it understandable to humans.  gcc, apparently, does
not.

Anyway, I hope that all of this isn't boring list members.  This isn't
really a "benchmark" list (and there are benchmark lists, e.g.
lmbench at bitmovers.com) but we do spend enough time talking about
numerical performance and price trade-offs.  Performance certainly
matters in beowulf design; it is the raison d'etre and all that if one
things about it.  

I think that it is very important that folks considering the purchase or
engineering out of components at hand of a beowulfish cluster understand
that finding the optimum beowulf design for a given budget is NOT a
"two-dimensional" problem.  Specifically, beowulf performance is by no
means determined by something silly like

  (MFLOPS/node)*(# of nodes) = Total MFLOPS

which one then maximizes for a given dollar investment.

As the graphs in the little paperlet (apologies to the inventors of the
"applet") aforementioned show, even the simplest possible definition of
MFLOPS itself varies by almost an order of magnitude as a function of
code size and access pattern.  As the MC benchmark shows, for some
applications rank order of CPU speed is determined more by MFLOPS
performance in L1 than by MFLOPS out of main memory (not really all that
surprising -- that is what cache is FOR).  For others I expect it is the
other way around.

It also shows that even (bogo)MFLOPS as a function of vector size at any
particular presumed vector size isn't a good predictor of relative or
absolute performance on complex code.  We'd need to separately measure
rates for e.g. integer arithmetic, loop management, trancendental calls
and factor them in as well.  Finally, there can be major differences
between compilers that are not themselves predictable.  I really didn't
expect gcc to be faster than ccc on the cpu-rate program, as I'd found
it SO much slower on OnSpin3d at EQUIVALENT sizes for the overall
program data.

And we haven't even gotten to the network yet, or factored in the cost
differential of the potential nodes.  Beowulf design likely requires
understanding of system performance and COTS economics in at least seven
or eight dimensions, more likely ten or twelve.  One of my many personal
projects is to build a set of tables and graphs of relative system
performance (and build or determine the best tools to measure them) and
post them up on the brahma site so that folks can study the performance
profiles in at least some (single or pairs of) dimensions.  

I'd also REALLY like to see something like the A(utomatic) T(uning)
concept introduced in ATLAS generalized and turned into a feature of
linux.  Linux is nearly unique in that it (along with e.g. FreeBSD) runs
on a wide range of hardware architectures, making code movement easy
between different platforms (something that used to be a major PITA).
However, making code PREDICTABLY EFFICIENTLY movable between platforms
is still not easy, as these simple tests show.

I see no way to look at any set of published benchmarks for the alpha
21264 vs P6 family and predict performance of OnSpin3d for all possible
lattice sizes (it does simulation on a 3d lattice).  Also, even if I'm
willing to rewrite it to run more efficiently, I'm NOT willing to
rewrite it to run more efficiently on the PIII and have it behave like a
pig on the Alpha, or have to rewrite it/retune it when Intel's or AMD's
next generation CPU finally comes out.  I'd like to rewrite it to
autotune.  But how?  I think this problem is solvable (and further that
it is the SAME problem that one faces writing parallel code run
efficiently on a heterogenous-node cluster architecture).

If we can come up with a spanning set of detailed system
"microbenchmark" performance measurements that are generated by a
standard program/daemon or kernel module and available via e.g. a /proc
interface or library call, one >>might<< be able to write code that
reads in these numbers and adjusts stride and matrix granularity and
parallel division accordingly to AT general purpose code.  lmbench
contains routines that can measure a lot of the requisite numbers, but
is still too "spotty" in its general application -- as the cpu-rate
figure shows, one needs to measure most rates as FUNCTIONS of e.g.
vector length to even BEGIN to get an understanding of them.  The ALSC
paper shows that this is equally true of net-based IPC rates.

I'm trying to write all this into the draft beowulf book as well.

To change subjects in the middle of the stream (of consciousness:-)...

...and speaking of the draft beowulf book (now available as an 
HTML-browesable document on brahma) Doug Eadline (who is planning to
join me on the project) and I have plans for making this a "living book"
in the sense that it is constantly maintained and updated.  The biggest
problem with writing a beowulf book (really any kind of systems book) is
that Moore's Law and RGB's Law of Open Source Development (see below:-)
conspire to make it largely obsolete in the 6-9 months between when the
book is "finished" and when one can finally get it out of the door in a
paper form.  One solution to this is to simply revise the book roughly
yearly and solicit contributions to the book from project developers
that detail their own beowulf-relevant projects.

Both Doug and myself feel rather firmly that the book MUST remain freely
available online in its "morphing" form, especially if it does end up
being collectively written in the sense that it contains chapters or
articles written and contributed by various individuals and groups,
although we're also dickering with publishers interested in publishing a
paper copy of it.  I personally would welcome an income stream from the
project since I'm spending a LOT of time on it, and I'd guess that any
contributors would get a proportional slice of any royalty pie if they
wished as well.

Anyone wishing to contribute a latex chapter:

\chapter{BeoDuctTape}

This is all about BeoDuctTape, a readily available tool that can
bind your pocket calculators into an advanced parallel computer...

on e.g. pvfs, bproc/scyld beowulf, MPI, PVM, CFD, XFS, ATLAS, or
whatever should contact me.  If there is enough interest (which would
likely be "any interest at all", since anything >>YOU<< write >>I<<
don't have to:-) I'll put an "author's kit" up on the brahma website or
create a website with a CVS repository for the project alone.  

Turnkey (or other) beowulf vendors are also welcome to send me SHORT
latex sections:

\section{BeoBeer}

Research has shown that beowulf builders consume large amounts of beer
per annum, and that they have highly refined taste when it comes to the
rich, amber liquid that flows down their gullets while pecking away at
keyboards in the dark of the night.  BeoBeer International is company
that was founded to provide easy access to the most exotic and
interesting brews and microbrews of the planet.  Visit our website,
www.beobeer.com, and arrange to have Red Stripe, Golden Eagle, or
Bohemia delivered straight to your system room door in recycled Dell and
IBM boxes labelled "Computer System", "Fragile".  Open the box, and find
carefully chilled bottles of golden delight...

At the moment all figures are eps included via psfig, but I'm likely
going to have to work out more general methodology for this...

   rgb

P.S. -- RGB's Law of Open Source Development simply recognizes that OS
development is a genetic optimization process that also shares some
features with critical processes such as crystal growth.  Read "The
Lucifer Principle" to get an idea of how GA's permeate all aspects of
human endeavor -- that OSD is one is no great surprise.  Thus:

Consider a "participating population" of size P.  There is a constant
nucleation of new projects at a rate proportional to the participating
population.  There is an abandonment of old tools that occurs at a much
slower rate nearly independent of P.

Projects at any given time have P_proj participants and attract new
participants proportional to (P-P_proj) (which obviously saturates when
everybody in the population is participating).  The >>rate<< for each
project, however, is determined by its general utility to individuals in
the population.  Finally, projects advance (their general utility) via a
genetic algorithm, which means that improvement is roughly proportional
to P_proj^3.  This same genetic algorithm means that unsuccessful
projects are constantly being "pruned", when their general utility falls
to zero (or becomes significantly smaller than the utility provided by a
competing project).

Expressed as differential equations and with some reasonable values for
the parameters, we should find that OSD generates LOTS of projects in
any given time period.  We should find that LOTS of them die out almost
immediately (or get to a certain stage and become frozen and eventually
disappear when not even the original authors still use the tools).  We
should find that SOME of them, often in families, grow exponentially
rapidly in P_proj and saturate (yes, I need more saturation parameters
in the rate description) and then advance extremely rapidly (because of
that P_proj^3).

BTW, this little differential system of equations, if fully developed
and made quantitative by measuring e.g. project appearance and
participation on freshmeat, sourceforge, bitkeeper, Gnome, or other OSD
sites, could be used to QUANTITATIVELY PREDICT approximately when
linux/Gnu in general will pass the various the closed source software
pools.  In general, closed source development surpresses the P_proj^3
and links participation rates directly to utility and saturates not on
the basis of population per se, but on a mix of population and cost.  A
fun little project for either economics students or computer science
students or a collaboration of both...;-)

The latter (rapidly growing, very useful, rapidly advancing) projects
are the real "problems" in writing a book on beowulfery.  We can
"predict" empirically on the basis of experience that some 3-5
"signficant" projects will appear in a year that "should" be in a book
on beowulfery, and that this rate itself may well grow as P_beowulf
continues to increase.  

In just six months, the Scyld project has already made part of my first
partial draft obsolete and forced a rewrite of several chapters.  My own
project on benchmarking and systems measurement, which was BEGUN because
of the book project (when I realized that my descriptions of system
performance were too qualitative and that no good tools existed as a
collection to make them quantitative), has also forced me to more or
less dump and rewrite whole chapters.  PVFS has advanced. MPI has
improved and continues to fix minor bugs.  Software advances actually
>>surpass<< Moore's Law when a project nucleates in a region of
unbounded growth (the "killer app" phenomenon, e.g. Lotus 123, the Web,
beowulfery itself).  In a year (an optimistic timeframe to completion of
a full draft) it will already be time to rewrite another couple of
chapters to incorporate new projects or alter critiques of particular
tools that in the meantime "fixed" the problems that were discussed.

Enough.

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu