[Beowulf] Performance metrics & reporting
Donald Becker
becker at scyld.com
Tue Apr 8 14:32:43 PDT 2008
On Tue, 8 Apr 2008, Jesse Becker wrote:
> Gerry Creager wrote:
> > Yeah, we're using Ganglia. It's a good start, but not complete...
>
> The next version of Ganglia (3.1.x) is being written to be much more easy to
> customize, both on the backend metric collection by allowing custom modules
> for gmond, and on the frontend with some changes to make custom reports easier
> to write. I've written a small pair of routines to monitor SGE jobs, for
> example, and it could easily be extended to watch multiple queues.
It might be useful to consider what we did in the Scyld cluster system.
We found that a significant number of customers (and potential customers)
were using Ganglia, or were planning on using it. But those that were
intensively using it complained about its resource usage. In some cases
it was using 20% of CPU time.
We have a design philosophy of running nothing on the compute nodes except
for the application. A pure philosophy doesn't always fit with a working
system, so from the beginning we built in a system called BeoStat (Beowulf
State, Status and Statistics). To keep the "pure" appearance of our
system we initially hid this in BeoBoot, so that it started immediately at
boot time, underneath the rest of the system.
How are these two related? To implement Ganglia we just chopped out the
underlying layers (which spend a huge amount of time generating then
parsing XML), and generate the final XML directly from the BeoStat
statistics already combined on the master.
This gave us the best of both worlds: no additional load on compute nodes,
lower network load, much higher efficiency, and easy scalability to
thousands of nodes from BeoStat, and the ability to log and summarize
historical data, good-looking displays and ability to monitor multiple
clusters from Ganglia.
It might be useful to look at the design of Beostat. It's superficially
similar to other systems out there, but we made decisions that are
much different than others -- ones that most consider wrong until they
understand their value.
Some of them are:
It's not extensible
It reports values in a binary structure
It's UDP unicast to a single master machine
It has no liveness criteria
The receive side stores only current values
The first one is the most uncommon. Beostat is not extensible. You can't
add in your own stat entries. You can't have it report stats from 64
cores. It reports what it reports... that's it.
Why is this important? We want to deploy cluster systems. Not build a
one-off cluster. We want the stats to be the same on every system we
deploy. We want every tool that uses the stats to be able to know that
they will be available. Once you allow and encourage a customizable
system, every deployment will be different. Tools won't work out of the
box, and there is a good chance that tools will require mutually
incompatible extensions.
Deploying a fixed-content stat system also enforces discipline. We
carefully considered what we need to report, and how to report it. In
contrast look at Ganglia's stats. Why did they choose the set they did?
Pretty clearly because the underlying kernel reported those values. What
do they mean? The XML DTD doesn't tell you. You have to look at the
source code. What do you use them for? They don't know, they'll figure
it out later.
People next question "but what if I have 8/16/64 cores? You only have 2
[[ now 4 ]] CPU stat slots." The answer is similar to above -- what are
you going to do with all of that data? The answer is summarize it before
using it. We just summarize it on the reporting side. We report that
there are N CPUs, the overall load average, and then summarize the CPU
cores as groups (e.g. per socket). For network adapters we report e.g.
eth0, eth1, eth2 and "all the rest added together".
Once we chose a fixed set of stats, we had a ability to make it a fixed
size report. It could be reported as binary values, with
any per-kernel-version variation done on the sending side.
Having a small, limited-size report meant that it fit in a single network
packet. That makes the network load predictable and very scalable.
It gave us the opportunity to effectively use UDP to report, without
fragmenting into multiple frames. UDP means that we can switch to and
from multicast without changes, even changing in real time.
A fixed-size frame makes the receiving side simple as well. We just
receive the incoming network frame into memory. No parsing, no
translation, no interpretation. We actually do a tiny bit more, such as
putting on a timestamp, but overall the receiving process does only
trivial work. This is important when the receiver is the master, which
could end up with the heaviest workload if the system isn't carefully
designed. We've support 1000+ machines for years, and are now designing
around 10K nodes.
We actually do a tiny bit more when storing a stat packet -- we add a
timestamp. We can use this to figure out the time skew between the master
and computer node, verify the network reliability/load, and to decide if
the node is live.
This isn't the only liveness test. It's not even the primary liveness
test. We document it as only a guideline. Developers should use the
underlying cluster management system to decide if a node has died. But if
there hasn't been a recent report, a scheduler should avoid using the node.
Classifying the world into Live and Dead is wrong. It's at least Live,
Dead and Schrodinger's Still-boxed Cat
Finally, this is a State, Status and Statistics system. It's a
scoreboard, not a history book. We keep only two values, the last two
received. That gives us the current info, and the ability to calculate
rate. If any subsystem needs older values (very few do) it can pick a
logging, summarization and coalescing approach of its own.
We made many other innovative architectural decisions when designing the
system, such as publishing the stats as a read-only shared memory version.
But this are less interesting because no one disagrees with them ;-).
--
Donald Becker becker at scyld.com
Penguin Computing / Scyld Software
www.penguincomputing.com www.scyld.com
Annapolis MD and San Francisco CA
More information about the Beowulf
mailing list