[Beowulf] Re: dealing with lots of sockets

Thu Jul 3 09:44:41 PDT 2008

On Wed, 2 Jul 2008, Perry E. Metzger wrote:

> Seeing is believing. There are lots of good papers out there on
> concurrency strategies for systems with vast numbers of sockets to
> manage, and there is no doubt what the answer is -- threads suck
> compared to events, full stop. Event systems scale linearly for far
> longer.

Sure, but:

> It depends. If you're doing something where there is going to be one
> socket talking to the system a tiny percentage of the time, why would
> you bother building an event driven server? If you're building

Or many sockets, but with a task granularity that makes your possibly
megaclock/millisecond task switching overhead irrelevant.  I'd have to
rebuild and rerun lmbench to get an accurate measure of the current
context switch time, but milliseconds seems far too long.  That sounds
like the inverse of the timeslice, not the actual CS time, which I'm
pretty sure has been MICROseconds, not milliseconds, since back in the
2.0 kernel on e.g. 200 MHz hardware.  My laptop is clocking over 2000
CS's/second sitting nearly idle -- that's just the "noise" of the
system's normal interactive single user function, and this when its
clock is at its idling 800 MHz (out of 2.5 GHz).  On the physics
department home directory (NFS) server I'm clocking only 7000-7500
CS/sec at a load average of 0.3 (dual core opteron at 2.8 GHz).  Since
nfsstat doesn't seem to do rates yet (and has int counters instead of
long long uints, grrr) it is a bit difficult to see exactly what this is
derived from in real time as far as load goes, but almost all of it
seems to be the processing of interrupts, as in the interrupt count and
the context switch count go in close parallel.

Now, I'm trying to understand the advantage you describe, so bear with
me.  See what you think of the following:

The kernel processes interrupts as efficiently as possible, with upper
half and lower half handlers, but either one requires that the CPU stop
userspace tasks and load up the kernel's interrupt handler, which
requires moving in and out of kernel mode.  We aren't going to quibble
about a factor of two here and I think that this is well within a factor
of two of a context switch time as most of the state of the CPU still
has to be saved so I'm calling it a CS even if it is maybe 80% of a CS
as far as time goes, depending on how badly one is thrashing the caches.

Network based requests are associated with packets from different
sources; packets require interrupts to process, interrupts from distinct
sources require context switches to process (do they not? -- I'm not
really sure here but I recall seeing context switch counts generally
rise roughly linearly with interrupt rates even on single threaded
network apps) so I would EXPECT the context switch load imposed by a
network app to be within a LINEAR factor of two to four independent of
whether it was run as a fork or run via events with one exception,
described below.

I'm estimating the load as packet arrives (I,CS), app gets CPU (CS),
select on FD triggers it's "second stage interrupt/packet handler" which
computes a result and writes to the network (I,CS), causes the process
to block on select, kernel does next task (CS).  So I count something
like two interrupts and four context switches per network transaction
for a single, single threaded network server application handling a
single, single packet request with a single, single packet reply.  If it
has to go to disk, add at least one interrupt/context switch.  So it is
four or five context switches, some of which may be lighter weight than
others, and presumes that the process handling the connection already
exists.

If the process was created by a fork or as a thread, there is technical
stuff about whether it is a kernel thread (expensive) or user thread
(much lighter weight, in fact, much more like a procedure call since the
forked processes share a memory space and presumably do NOT need to
actually save state when switching between them).  They still might
require a large chunk of a CS time to switch between them because I
don't know how the kernel task switcher manages select statements on the
entire cluster of children associated with the toplevel parent -- if it
sweeps through them without changing context there is one level of
overhead, if it fully changes context there is another, but again I
think we're talking 10-20% differences.

Now, if it is a SINGLE process with umpty open file descriptors and an
event loop that uses SOME system call -- and I don't quite understand
how a userspace library can do anything but use the same systems calls I
could implement in my own code to check for data to be read on a FD,
e.g. select or some sort of non-blocking IO poll -- each preexisting,
persistent connection (open FD) requires at least 2-3 of the I/CS pairs
in order to handle a request.  In other words, it saves the CS
associated with switching the toplevel task (and permits the kernel to
allocate timeslices to its task list with a larger granularity, saving
the cost of going in/out of kernel mode to initiate the task/context
switch).  This saves around two CS out of four or five, a factor of
around two improvement.  Not that exciting, really.

The place where you get really nailed by forks or threads is in their
creation.  Creating a process is very expensive, a couple of orders of
magnitude more expensive than switching processes, and yeah, even order
ms.  So if you are handling NON-persistent connections -- typical
webserver behavior: make a connection, send a request for a file,
receive the file requested, break the connection -- handling this with a
unique fork or thread per request is absolute suicide.  So there a
sensible strategy is to pre-initiate enough threads to be able to handle
the incoming request stream round-robin, so that as each thread takes a
request, processes it, and resumes listening for the next request.  This
requires the overhead of creating a FD (inevitable for each connection),
dealing with the interrupt/CSs required to process the request and
deliver the data, then close/free the FD.  If the number of daemons
required to process connections at incoming saturation is small enough
that the overhead associated with processing the task queue doesn't get
out of hand, this should scale very nearly as well as event processing,
especially if the daemons are a common fork and share an address space.

The last question is just how efficiently the kernel processes many
blocked processes.  Here I don't know the answer, and before looking it
up I'll post the question here where probably, somebody does;-)

If the connections are PERSISTENT -- e.g. imap connections forked by a
mail server for mail clients that connect and stay connected for hours
or days -- then as a general rule there will be no I/O waiting on the
connections because humans type or click slowly and erratically, mostly
a poissonian load distribution.  If the kernel has a way of flagging
applications that are blocked on a select on an FD without doing an
actual context switch into the application, the scheduler can rip
through all the blocked tasks without a CS per task, at an overhead rate
within a CS or two per ACTIVE task (one where there IS I/O waiting) of
the hyperefficient event-driven server that basically stays on CPU
except for when the CPU goes back to the kernel anyway to handle the
packet stream during interrupts and to advance the timer and so on.  I
don't KNOW if the kernel manages runnable or blocked at quite this level
-- it does seem that there are fields in the task process table that
flag it, though, so I'd guess that it does. It seems pretty natural to
skip blocked processes without an actual CS in and out just to determine
that they are blocked, since many processes spend a lot of time waiting
on I/O that can take ms to unblock (e.g. non-cached disk I/O).  So I'm
not certain that having a large number of idle, blocked processes
(waiting on I/O on a FD with a select, for example) is a problem with
context switches per se.

> something to serve files to 20,000 client machines over persistent TCP
> connections and the network interface is going to be saturated, hell
> yes, you should never use 20,000 threads for that, write the thing
> event driven or you'll die.

Here there are a couple of things.  One is that 20K processes MIGHT take
20K context switches just to see if they are blocked on I/O.  If they
do, then you are definitely dead.  20K processes also at the very least
require 20K entries in the kernel process table, and even looping over
them in the scheduler to check for an I/O flag with no CS is going to
start to take time and eat a rather large block of memory and maybe even
thrash the cache.  So I absolutely agree, 20K mostly-idle processes on a
running system -- even a multicore with lots of memory -- is a bad idea
even if they are NOT processing network requests.  Fortunately, this is
so obvious that I don't think anybody sane would ever try to do this.

Second, 20K NON-persistent connections on an e.g. webserver would be
absolute insanity, as it adds the thread creation/destruction overhead
to the cost of processing the single-message-per-connection interrupts.
It just wouldn't work, so people wouldn't do that.  IIRC there were a
few daemons that did that back in the 80's (when I was managing Suns)
and there were rules of thumb on running them.  "Don't" is the one I
recall, at least if one had more than a handful of hosts connecting.

Running 10-20 parallel daemons might work, and people do that -- httpd,
nfsd. Running an event driven server daemon (or parallel/network
application) would work, and people do that -- pvmd does that, I
believe.  Which one works the best? I'm perfectly happy to believe that
the event driven server could manage roughly twice as many make/break
single message connections as a pile of daemons, if the processes aren't
bottlenecked somewhere other than at interrupt/context switches.  If we
assume that at CS takes order of 1-10 usec on a modern system, and it
takes a SMALLER amount of time to do the processing associated with a
request, then you'll get the advantage.  If each request takes (say)
order of 100 usec to handle, then you'll be bottlenecked at less than
10,000 requests per second anyway, and I don't think that you'd see any
advantage at all, although this depends strongly on whether or not one
can block all the daemons somewhere OTHER than the network.

The question then is -- what kind of traffic is e.g. an NFS server or a
mail server as opposed to a web server?  NFS service requires
(typically) at least an fstat per file, and may or may not require
physical disk access with millisecond scale latencies.  Caching reduces
this by orders of magnitude, but some patterns of access (especially
write access or a nasty mix of many small requests -- latency bound disk
accesses) don't necessarily cache well.  It is not at all clear, then,
that an event driven NFS server would ultimately scale out better than a
small pile of NFS daemons as the bottleneck could easily end up being
the disk, not the context switch or interrupt burden associated with the
network.  Mail servers ditto, as they too are basically file servers,
but ones for which caching is of little or no advantage.  Event driven
servers might get you the ability to support as much as a factor of two
more connections without dying, but it is more likely that other
bottlenecks would kill your performance at about the same number of
connections either way.

To bring the whole thing around OT again, a very reasonable question is
what kind of application one is likely to encounter in parallel
computing and which of the three models discussed (forking per
connection, forking a pile of daemons to handle connections round robin,
single server/daemon handling a table of FDs) is likely to be best.

I'd argue that in the case of parallel computing it is ALMOST completely
irrelevant -- all three would work well.  If one starts up a single e.g.
pvmd or lamd, which forks off connected parallel applications on
request, then typically there will a) only be roughly 1 such fork per
core per system, because the system will run maximally efficiently if it
can just keep streaming memory streaming in and out of L1 and L2; b)
they will have a long lifetime, so the cost of the fork itself is
irrelevant -- a ms out of hours to days of computing; c) internally the
applications are already written to be event driven, in the sense that
they maintain their own tables of FDs and manage I/O either at the level
of the toplevel daemons (who then provide it as streams to the
applications) or within the applications themselves via library calls
and structures.  I THINK PVM is more the former model and MPI the
latter, but there are many MPIs.

For other associated cluster stuff -- a scheduler daemon, an information
daemon such as xmlsysd in wulfstat -- forking vs non-forking for
persistent connections (ones likely to last longer than minutes) is
likely to be irrelevantly different.  Again, pay the ms to create the
fork, pay 6 interrupt/context switches instead of 4 or 5 per requested
service with a marginal cost of maybe 10 usec, and unless one is
absolutely hammering the daemon and the work done by the daemon has
absolutely terrible granularity (so it is only DOING order of 10 or 100
usec of work per return) it is pretty ignorable, especially on a system
that is PRESUMABLY spending 99% of its time computing and the daemon is
basically handling out of band task monitoring or control services.

> It is all about the right tool for the job. Apps that are all about
> massive concurrent communication need events. Apps that are about very
> little concurrent communication probably don't need them.

Absolutely, but do they need libevents, or do they simply need to be
sensibly written to manage a table of fds and selects or nonblocking
polls?  I've grabbed the source for libevents and am looking through it,
but again, it seems to me that it is limited to using the systems calls
the kernel provides to handle I/O on open FDs, and if so the main reason
to use a library rather than the calls directly is ease of coding, not
necessarily efficiency.  Usually the code would be more efficient if you
did the same thing(s) inline, would it not?

The one thing I completely agree with is that one absolutely must remain
aware of the high cost of creating and destroying threads/processes.
Forking is expensive, and forking to handle a high-volume stream of
transient connections is dumb.  So dumb that it doesn't work, so nobody
does this, I think.  At least, not for long.

> More the former, not the latter. Event driven programming typically
> uses registered callbacks that are triggered by a central "Event Loop"
> when events happen. In such a system, one never blocks for anything --
> all activity is performed in callbacks, and one simply returns from a
> callback if one can't proceed further. The programming paradigm is
> quite alien to most people.

Fair enough, because most people don't write heavily parallel
applications (which includes applications with many parallel I/O
streams, not just HPC).  But people who do fairly quickly learn to work
out the scaling and overhead, do they not, at least "well enough" to
achieve some level of performance?  Otherwise the applications just fail
to work and people don't use them.  Evolution in action...;-)

This has been a very informative discussion so far, at least for me.
Even if my estimates above are all completely out of line and ignore
some key thing, all that means is I'll learn even more.

The one thing that I wish were written with some sort of internal
scheduler/kernel and event mechanism from the beginning is X.  It has
its own event loop, but event-driven callbacks all block -- there is no
internal task parallelism. It is a complete PITA to write an X
application that runs a thread continuously but doesn't block the
operation of the GUI -- one has to handle state information, use a
separate thread, or invert all sorts of things from the normal X
paradigm of click and callback.  That is, most X apps are written to be
serial and X itself is designed to support serial operation, but
INTERESTING X apps are parallel, where the UI-linked I/O channels have
to be processed "independently" within the X event loop while a separate
thread is doing a task loop of interesting work.  AFAIK, X only supports
its own internal event loop and has horrible kludges to get the illusion
of task parallelism unless one just forks a separate thread for the
running "work" process and establishes shared state stuctures and so on
so that the UI callbacks can safely affect work going on in the
work-loop thread without blocking it.

> I'd read the libevent man page to get a vague introduction.

There doesn't seem to be one in the source tarball I downloaded.  Only
event.3 and evdns.3, neither of which are terribly informative.  In
fact, the documentation sucks.  There is more space on the website
devoted to pictures of good vs terrible scaling with/without libevent
than there is documentation of how it works or how to use it, and of
course it is difficult to know if the figures are straw men or fair
comparisons.  There are a few chunks of sample code in samples.  I'll
take a look and see what I can see when I have time.

I'm working on an X-based GUI-controlled application and do have a
forking daemon (xmlsysd) that so far seems to work fine at the level of
traffic it was designed for and is likely bottlenecked someplace other
than CSs long before they become a problem, but this conversation has
convinced me that I could rewrite at least the latter in a way that is
more efficient even if I do leave it forking per connection (or using
xinetd, as it usually does now:-).  It is a monitoring daemon, and is
fairly lightweight now because one doesn't want to spend resources
watching a cluster spend resources.  If I redesigned it along lines
suggested by the analysis above, I could permit it to manage many more
connections with one part of its work accomplished with roughly constant
overhead, where now the overhead associated with that work scales
linearly with the number of connections.

    rgb

>
>
>

-- 
Robert G. Brown                            Phone(cell): 1-919-280-8443
Duke University Physics Dept, Box 90305
Durham, N.C. 27708-0305
Web: http://www.phy.duke.edu/~rgb
Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977