dealing with lots of sockets (was Re: [Beowulf] automount on high ports)

Robert G. Brown rgb at phy.duke.edu
Wed Jul 2 16:44:58 PDT 2008


On Wed, 2 Jul 2008, Perry E. Metzger wrote:

>
> "Robert G. Brown" <rgb at phy.duke.edu> writes:
>>> Well, it actually kind of is. Typically, a box in an HPC cluster is
>>> running stuff that's compute bound and who's primary job isn't serving
>>> vast numbers of teeny high latency requests. That's much more what a
>>> web server does. However...
>>
>> I'd have to disagree.  On some clusters, that is quite true.  On others,
>> it is very much not true, and whole markets of specialized network
>> hardware that can manage vast numbers of teeny communications requests
>> with acceptably low latency have come into being.  And in between, there
>> is, well, between, and TCP/IP at gigabit speeds is at least a contender
>> for ways to fill it.
>
> I have to admit my experience here is limited. I'll take your word for
> it that there are systems where huge numbers of small, high latency
> requests are processed. (I thought that teeny stuff in HPC land was
> almost always where you brought in the low latency fabric and used
> specialized protocols, but...)

Not so much high latency, but there are many different messaging
patterns.  Some are BW dominated with large messages, some are many
small latency dominated messages and use specialized networks, and some
are in between -- medium sized messages, medium sized amounts.  YMMV is
a standard watchword.  Some people do fine with TCP/IP over ethernet for
whatever message size.  I'm not quite sure what you mean by "vast
numbers of teeny high latency requests" so I'm not sure if we really are
disagreeing or agreeing in different words.  If you mean that the
problem of HA computing on a human timescale is different than the
typical HPC problem then we agree very much, but then I don't see the
point in the context of the current discussion.

> Sure, but it is way inefficient. Every single process you fork means
> another data segment, another stack segment, which means lots of
> memory. Every process you fork also means that concurrency is achieved
> only by context switching, which means loads of expense on changing
> MMU state and more. Even thread switching is orders of magnitude worse
> than a procedure call. Invoking an event is essentially just a
> procedure call, so that wins big time.

Sure, but for a lot of applications, one doesn't have a single server
with umpty zillion connections -- which may be what you may mean with
your "high latency teensy message" point above.  If the connection is
persistent, the overhead associated with task switching is just part of
the normal multitasking of the OS.  In cluster computing, one may have
only a small set of these connections to any particular host, or one may
have lots -- many to many communications, master-slave communications.
Similarly, many daemon-driven tasks tend to be quite bounded.  If a
server load average is down under 0.1 nearly all the time, nobody cares,
if the overhead of communication in a parallel application is down in
the sub-1% range, people don't care much.  But then, few cluster
applications are built on forking daemons...;-)

Still, it is important to understand why there are a lot of applications
that are.  In the old days, there were limits on how many processes, and
open connections, and open files, and nearly any other related thing you
could have at the same time, because memory was limited.  Kernel
resources (if nothing else) have to be allocated for each one, and
kernel overhead associated with all of the connections, files, etc could
scale up to where it more or less shut down a system.  Nowadays, with my
LAPTOP having 4 GB, multiple cores, far more scalable MP kernels, the
limits are a lot more flexible, and it may well be better to maintain
many persistent connections within a single application and make it
essentially an extension of the kernel with the kernel managing the
"multitasking" overhead of message reception per connection and then
avoiding the additional multitasking associated with farming the
information out per connection to a forked copy of a server process.  As
I said, very interesting and a good idea -- I'm learning from you -- but
a good idea for certain applications, possibly more trouble than it's
worth for others?

Or maybe not.  If you make writing event driven network code as easy,
and as well documented, as writing standard socket code and standard
daemon code, the forking daemon may become obsolete.  Maybe it IS
obsolete.

So, what do you think?  Should one "never" write a forking daemon, or
inetd?  [Incidentally, does this mean that you are similarly negative
about forking applications in general, since similar resource
constraints apply to ALL forks, right?  Or should one use event driven
servers only for big servers with no particular hurry on returning
messages for any given connection?  I'm guessing that when writing such
a server, one has to do some of the work that the kernel would do for
you for forked processes -- ensure that no connection is starved for
timeslices or network slices, manage priorities if necessary, smoothly
multitask any underlying computation associated with providing the data.
After all, the MOST efficient server is one with the server code built
into the kernel -- DOS plus an application, as it were.  Why bother with
the overhead of a general purpose multitasking operating system when you
can handle all the multitasking native within your one monolithic
application?  Ditto networking -- why replicate general purpose features
of the network stack in the kernel and network structs when you'll never
need them for your ONE application?

Usually one trades off the ease of programming and use in a general
purpose environment against some penalty, as general purpose
environments require more state information and overhead to maintain and
operate.  So are you arguing that there are no tradeoffs, and one should
"always" write server network code (or code in a suitably segmented
application) on an event model, or that it is a better one for some
class of client applications, some pattern of use?

This still is (I think) OT, as master-slave parallel applications are
fairly common, with a toplevel master doling out units of work to the
slaves and then collecting the results.  I think that it is probably
more usual to write the code for this as a non-forking application
anyway, but I can still imagine exceptions.  IIRC, some of these things
are the motivation for e.g. Scyld and bproc.  If anyone else on list is
bored with this, let me know we can take it offline.

> Event driven systems can also avoid locking if you keep global data
> structures to a minimum, in a way you really can't manage well with
> threaded systems. That makes it easier to write correct code.
>
> The price you pay is that you have to think in terms of events, and
> few programmers have been trained that way.

What do you mean by events?  Things picked out with a select statement,
e.g. I/O waiting to happen on a file descriptor?  Signals?  I think the
bigger problem is that a lot of the events in question are basically
(fundamentally) kernel interrupts, I/O being driven by one or more
asynchronous processes, and you're right, a lot of programmers never
learn to manage this because it is actually pretty difficult.  One has
to handle blocking vs non-blocking issues, raw I/O (in many cases), a
scheduler of sorts to ensure that connections aren't starved (unless you
are content to process events in FIFO order, letting an event piggy or
buggy/crashed process hang the entire pending queue).

Forking provides you with a certain amount of "automatic" robust
parallelism.  Without it, one has to make the code a lot more robust; if
a forked connection crashes, it crashes just one connection, not the
server or any of the rest of the existing connections.  The kernel DOES
do a lot of things for you on a forked process that you have to do for
yourself in event driven code, and it isn't exactly trivial to provide
it either well, efficiently, or robustly (where the kernel is perforce
all three, within the limits imposed by its general purpose design).

As I said, people wrote lots of applications on UDP because they thought
"hmmm, I don't need ALL the overhead associated with making TCP robust,
I'll use lightweight UDP instead and write my own packet sequencer, my
own retransmit, etc."  Then they discovered that by the time they ended
up with something that was reliable, they hadn't really saved much -- or
may well have ended up with something even slower than TCP.  People
work(ed) HARD on making TCP fairly efficient and making it handle edge
cases.  Doing it on your own is unlikely to match either one, unless you
are an uberprogrammer.  You sound like you probably are, but I'm not
sure everyone is...;-)

I'm not arguing, mind you -- I already believe that writing an event
driven server (or client, or both in a more symmetric model) makes sense
for a certain class of applications, including many/most of the ones
relevant to cluster computing.  I'm asking if one should NEVER write a
forking daemon because the libraries you mention above provide
schedulers and can manage dropped connections or hung resources or
because you think that the programmer should always be able to add them
as needed, or if there is a problem scale and server type for which it
makes sense, and others for which it is overkill or for which the
services provided by the kernel for forked processes (or threads, a rose
by any other name...) are worth their cost.  An event driven application
IS basically a kernel, in a manner of speaking.  Should every daemon be
a kernel, or can some use the existing kernel for kernel-like
functionality and focus on just provisioning a single connection well?

    rgb

>
> Perry
>

-- 
Robert G. Brown                            Phone(cell): 1-919-280-8443
Duke University Physics Dept, Box 90305
Durham, N.C. 27708-0305
Web: http://www.phy.duke.edu/~rgb
Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977



More information about the Beowulf mailing list