quick question

Tue Jun 20 19:56:34 PDT 2000

Bradley Alexander writeth:
> Lets take this a step further. I was, a long time ago, looking at an
> application to parallelize the SHADOW IDS' analysis station. Rather than
> simply running a separate analyzer on each node, I thought that parallelizing
> the process would actually be more efficient. I thought that it could handle a
> number of separate sensors etc. (I should note that SHADOW is a client/server
> setup in the form of sensors that use tcpdump to capture traffic, and an
> analysis station that analyzes these tcpdump files.)
> 
> Unfortunately other duties have kept me from pursuing this as yet, but one of
> the problems I found that I had was getting the logs (9+GB/hour) back to the
> cluster at anything resembling reasonable time, especially since it would have
> to be an out-of-band transfer to keep from choking the network its supposed to
> be watching. ("Your IDS just caused a DoS, so GET OUT." :-)

There may be other ways to do this.

Suppose we have a set of categories C1, C2, C3, etc., and a set of
sensors S1, S2, S3, etc., producing a set of events E1, E2, E3, etc.
Each event En is produced by one sensor Sn and belongs to some set Cn,
Cm, Cp of categories.

Now, suppose you have a set of machines M1, M2, M3, etc., each of which
is devoted to analyzing one category of events: M1 analyzes events in
C1, M2 analyzes events in C2; in general Mn analyzes events in Cn.
Then, when a sensor produces an event, instead of sending it to a
central choke point, it determines which categories (Cn, Cm, Cp) it
belongs to, and sends it to the appropriate machines Mn, Mm, Mp.

This way, no machine receives more traffic than belongs in a single
category; you might have 20 megabits of aggregate event traffic ---
9GB/hour --- but each analysis machine will only have a fraction of
that level of traffic flowing into it.  If your network aggregate
bandwidth is much bigger than 20 megabits --- say, you have a 36-port
100BaseT switch with a 7.2 gigabit backplane bandwidth --- you have
SOLVED THIS PROBLEM.

As a further refinement, M1, M2, M3, etc., can be the same machines
that run the sensors; this prevents you from having to buy and admin a
separate analysis cluster.  You can actually have the Ms be "virtual
machines" that move dynamically from one sensor machine to another for
load-balancing.

> > [I'm just guessing that the log-processing code is currently in Perl. :) ]
> 
> Since most of SHADOW is written in Perl, isn't there a parallelized Perl module?

Not that I know of.  Profile SHADOW and see if you can double its speed
by rewriting 4% of it in C, making it 10-200 times faster :)

-- 
<kragen at pobox.com>       Kragen Sitaker     <http://www.pobox.com/~kragen/>
The Internet stock bubble didn't burst on 1999-11-08.  Hurrah!
<URL:http://www.pobox.com/~kragen/bubble.html>
The power didn't go out on 2000-01-01 either.  :)