[Beowulf] What services do you run on your cluster nodes?

Wed Sep 24 10:35:10 PDT 2008

On Tue, 23 Sep 2008, Donald Becker wrote:

>> XML is (IMO) good, not bad.
>
> I have so much to write on this topic, I'll take the first pot shot at RGB
> ;-)
>
> XML is evil.  Well, evil for this.

Oh, it's fine.  I've gone the rounds on this one with Linus Himself (who
agrees with you, BTW:-).

However, I present case study of the two approaches below that shows
that a) this is really a question of tradeoffs, cost-benefit, with
advantages and disadvantages both ways but DIFFERENT advantages both
ways; b) the costs of using XML are less than you are implying, for at
least the kind of cluster and usage for which xmlsysd was designed.

Either way then, one has to consider whether benefits are worth the
costs -- to you, for your cluster and your task mix -- as usual.

> Ganglia does consume a significant portion of resources.  I've heard
> first-hand reports of 20% CPU.  (Admittedly before they figured out what
> was happening and turned the reporting list and frequency way down.)
>
> When we found that we needed to support Ganglia, I grumbled mightily and
> set out to figure out why it was so bad.  Deep in its black heart I found
> the source of evil: XML.  And not just the detailed evil of the syntax,
> the evil power had seeped into the higher level architecture with
> needless flexibility.[*1]

I'd argue -- persuasively I think -- that the evil was not in XML per se
but in various other aspects of ganglia such as its flexibility and
ability to absorb and report back "any" statistic.  The CPU overhead on
a node required to generate a single packet or short message spanning a
few packets of ANY sort -- including compressed and encrypted -- at a
time granularity of once per second is a negligible fraction of a
percent.

Now, Linus did argue equally (or even more:-) persuasively that the
overhead associated with converting /proc itself to use XML per se was
too great and sure, he's probably correct.

Still, there would be an ADVANTAGE to doing so, even if one did it as an
exercise and then simply threw it away:  constructing an XML view of
/proc would force one to directly confront all of the deep, abiding evil
that resides therein from NOT enforcing any sort of XML-like
hierarchical view of its contents.  Anybody who has had to write code
that parses /proc has to deal with what, three or four distinct
constructs for associating a variable of interest with a value, with an
even greater variety of ways of expressing the hierarchy in which that
value lies?  /proc contains everything from file containing value name
(separator) value to name (separator) value_1 (separator) value_2 ...
value_1024 (often in the same file, e.g.  /proc/stat) to tables
(/proc/net/dev) to pathlike hierarchy /proc/path/value_name files with
contents value.

Where the latter, I would note, is more or less identical to at least
one of the canonical views of XML and is in fact how I lay out and parse
the XML returned by xmlsysd.  Values laid out in this way constitute
most of the newest ones provided by the kernel -- the problem is that
there is a LEGACY of code that provided values laid out however the
inventor of that particular chunk of code wanted to lay it out.  One day
(maybe) we can hope that an e.g.  3.x kernel will come along and enforce
a strict standard hierarchical view of /proc that "is" XML even if it
uses nominal filesystem paths to precisely correspond to XML
hierarchical nesting, but because of inefficiencies in using the
filesystem in this way INSTEAD of single files with XML, or because of
the demands for efficiency when parsing minimal but nearly
human-unreadable e.g. /proc/PID/stat perhaps not.

So I >>agree<< with your point 1, but it isn't the correct point.  /proc
was written without XML but ended up an ill-disciplined mish-mosh of
(possibly) locally optimized parsable objects as a consequence not of
use or non-use of XML, but because of a lack of discipline in the design
and layout or clear correspondance with data objects.  Point 1 basically
says that using XML doesn't FORCE you to use discipline and consequently
one can implement it badly and nothing could be more true, but it isn't
for lack of trying in the XML standard.  XML is flexible and extensible
for a reason, but that doesn't mean that it should be taken as an excuse
not to -- eventually -- develop a sound set of data objects.  If
anything, it should be a means to help one figure out what that sound
set of objects really IS, because the process of encapsulation most
naturally reflects the data structures into which they are placed in
memory.

Its biggest problem isn't discipline, it is size.  The hierarchical
discipline it attempts to force on users is, if anything, good for them
in most cases.  It forces them to confront the question of just what the
right data objects are and how they relate to other objects and are most
naturally arranged.  The data view thus produced is self-documenting to
the extent sensible tag names are chosen, manifestly hierarchical (it
may not be the RIGHT hierarchy, but it is a hierarchy just the same),
and consequently quite human readable.  It can be parsed trivially with
readily available tools instead of black magic (where /proc is much more
in the latter category).  To the extent that one has built a GOOD data
hierarchy, it is quite extensible -- it is easy to poke new objects into
the data hierarchy without breaking the path to old objects, so
backwards and even forwards compatibility isn't out of the question.  To
the extent that the data objects thus encapsulated are well defined, one
can even build tools that manipulate the data automagically in many ways
while pulling e.g. attributes to control the process from the XML itself
-- XML is in principle rather portable.  It is, in fact, rather
BEAUTIFUL.

The beauty has a price, and the price is size (and to a lesser extent,
the additional time required to generate the beautiful human readable
view).  The cheap way to send a data structure across the network is to
take the raw data structure, treat it as an anonymous block of memory,
and send it, placing the block on the other end into an identical data
structure via the magic of pointers and thereby dereferencing its
contents.  The information being sent will in such a way will often be
almost incompressible so that you CAN'T make it any smaller.  It is
efficiency itself -- the smallest packet one can produce with the least
possible amount of handling and processing.  It is also human
unreadable, tremendously fragile (if e.g.  one end is bigendian and the
other littlendian or data objects have different sizes in the struct etc
it will fail silently and often tragically), nearly impossible to debug,
and STILL subject to good or bad data objects in the layout of the
struct itself.

So the tradeoff is really a familiar one.  Code/Data efficiency vs
Code/Data readability and robustness.  These tradeoffs make everybody
unhappy but lets them CHOOSE their personal degree of unhappiness.  They
can send raw binary and be happily efficient and sad when it breaks or
changes or they are developing the application.  They can send XML and
be happily robust, happily readable by humans, happily debuggable and
extensible and portable, but sad when they see how large the message is
that they send compared to an information-theoretically minimal message
with the same content, sad when they note that they have to do a bit of
work to prepare the message in the first place for sending.

What is the RIGHT tradeoff?  Depends on the problem, of course.  Human
readability and maintainability and extensibility is worth a lot, but
some problems will tolerate the loss of efficiency it costs to achieve
them and some won't.  Some problems will tolerate it at SOME SCALES but
not at EVERY scale.

Usually, the best possible tradeoff is one that doesn't, one that yields
the best of both worlds.  For example, XML would be just great if it
were possible to construct an a priori tag dictionary, create a map from
tags to information-theoretically minimal binary (where every tag in
most problems could be reduced to one or at most two bytes in length),
and install the DICTIONARY on both sender and receiver ends.  This would
effectively compress the actual XML to where the overhead associated
with sending the message is down to perhaps 10-40% of the raw binary
message, not an unreasonable price to pay for trivial library-based
reversal at the other end into human readable, debuggable, etc XML, and
it could accomplish it with minimal CPU -- less than what would be
required to actually compress each message and send along a per-message
dictionary and probably less than what is required to construct the
uncompressed XML message.  But this is just one of many binary XML
schemes and projects underway, and if/when any are adopted as real
standards and widely supported by library calls people will have to
continue to make a moderately unsatisfactory choice or roll their own
(e.g. (de)compressing with gzip on both ends).

In the particular case of xmlsysd, it is roughly the fourth complete
rewrite (the first two were called procstatd and were ugly and
undisciplined, the first xmlsysd was much prettier and disciplined but
didn't get the hierarchies quite right, and the current version of
xmlsysd is -- I think -- very nicely hierarchically organized.  It
achieves adequate efficiency for many uses a different way -- it is only
called on (client side) demand, so the network isn't cluttered with
unwanted or unneeded casts (uni, multi, broad).  It is "throttleable" --
the client controls the daemon on each host, and can basically tell it
to send only what the client wants to monitor (at some granularity in
the hierarchy -- not specifying individual fields but the contents of
e.g. specific proc-derived files).  So if you aren't monitoring memory,
you don't read or parse or send the contents of /proc/meminfo.  If you
are only interested in (say) 1, 5, 15 load average, you throttle so that
only that is sent:

init
off all
on loadavg
send
Content-Length: 217

<?xml version="1.0"?>
<xmlsysd init="1">
   <proc>
     <loadavg tv_sec="1222275182" tv_usec="443155">
       <load1>0.05</load1>
       <load5>0.12</load5>
       <load15>0.05</load15>
     </loadavg>
   </proc>
</xmlsysd>

This is down to a few hundred bytes.  Sure, it could be sent in 20 bytes
as four uints and a luint, but then it would look like:

%!1 )

or some such, and nobody could read it without a binary tool, and one
would have to parse the receiving socket by byte count since EFFICIENCY
doesn't even permit a line terminator.

So the cost in message length is profound -- a factor of 10 in this
case.  The cost in human readability and portability is equally profound
-- one is, one isn't, period.  But what is the cost in what REALLY
counts: time?

Well, both of them have to be sent by the network.  One can choose UDP
or TCP for either one, and each has advantages and disadvantages (with
-- gulp -- the same tradeoffs between reliability and robustness and
speed).  Fine, pick (say) TCP.  In that case one pays a TCP latency hit
to send EITHER message, and then a bandwidth hit roughly proportional to
the total message size INCLUDING the requisite header.  Including the
header the longer message is only roughly four times the size of the
latter, not ten.  The TCP latency hit is also (usually) much larger than
the bandwidth hit for minimum size messages -- so much so that it might
well take less than twice as long to send the longer message than the
shortest possible one.  And that cost is the cost of A SINGLE NETWORK
MESSAGE -- worst case a few hundred microseconds, even if the message
spans several packets.

Sending a message every second might therefore be expected to "waste" as
much as 1/1000th of the compute time available in that second -- 0.1% --
compared to the unreadable but shortest possible message, which itself
requires just under 0.1%.  In practice (using xmlsysd to monitor system
load while it itself runs) this is a reasonable estimate that is if
anything conservative.  UDP has a similar analysis, but scaled down by a
factor of two or so -- it doesn't really change the general conclusions,
which are that the impact ON THE NODES of running xmlsysd -- especially
on its recommended/wulfstat-default 5 second sampling interval -- is
pretty much negligible, XML and all, for non-synchronous nodes.  For
synchronous nodes, serially polling the nodes (as wulfstat does) leads
to serial interruption of node progress and the efficiency hit is
nonlinearly amplified.

One can get around this by sampling less often -- even out at 1000 nodes
it should update in around a second, so polling them only once every 120
seconds gives you node status on all nodes within the last two minutes
at less than 1% cost in efficiency, or of course scale up to even
greater efficiency even less often.  But truthfully, xmlsysd wasn't
designed for usage on a large cluster running synchronous code.

> Those familiar with BeoStat might point out that it does not have the same
> functionality.  It doesn't report everything that Ganglia can.  That's
> exactly the point of BeoStat -- it reports only the essential state,
> status and statistics for nodes.  And it reports the same items, every time
> and in every installation.  It's much more easily used by tools because
> they can count on which values will be reported, and the update frequency.
> Static values are gathered once a "node_up" time, and changing values
> are updated once per second while the node is up.

All good and very similar to what xmlsysd does as well, except that
xmlsysd only updates values when requested by the master node/client,
and imposes no load at all until it is needed.  But you use beostat as
part of a tightly integrated set of tools that function as a cluster
"kernel" -- I was just writing a cluster monitoring tool that doesn't
generally need to collect information on a time granularity smaller than
once every 5 seconds or so and that might as well turn off entirely when
I'm not watching.

> BeoStat is used by multiple schedulers, mappers, stat displays (e.g.
> Ganglia) and GUIs.  It's an essential part of the Scyld architecture and
> reflects it design philosophy -- minimizing operational overhead on the
> compute nodes, while simplifying and unifying the cluster.  We use a
> single reporting daemon per compute node, and the results are gathered
> together and published at one place -- master nodes.

Actually, from here on down -- with the exceptions of choosing to
use xml to encapsulate the return, TCP instead of UDP, and allowing the
client side to control and throttle the daemon so that one can tune the
impact of the monitoring to the demands of the cluster and task, the two
things sound very similar -- as they should be, given that they're both
N>3 generation tools.  I expect yours works better on really large
clusters and more efficiently on all clusters, at the expense of not
being terribly easy to get at without the tools of your particular
toolset.  xmlsysd, OTOH, is like sendmail.  You can telnet to it and
talk to it with a simple command set and read (or parse) what it sends
back with tools either provided or that you construct yourself.  And it
requires nothing special -- drop it in place, use it.  Works on LANs and
personal systems as well as cluster nodes.

    rgb

-- 
Robert G. Brown                            Phone(cell): 1-919-280-8443
Duke University Physics Dept, Box 90305
Durham, N.C. 27708-0305
Web: http://www.phy.duke.edu/~rgb
Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977