quick question

Tue Jun 20 14:03:39 PDT 2000

On Tue, 20 Jun 2000, Kragen Sitaker wrote:

> This is not correct.  There are several ways to partition problems in
> general, and log-processing problems in particular, and splitting up
> the input data is only one of them.
> 
> Some examples:  
> 
> - if you're running a pipelinable problem --- separable, sequential
>   stages, each with a relatively high computation-to-data ratio (say, a
>   billion or more instructions for every twelve megabytes, thus a
>   thousand instructions for every twelve bytes or so) --- you can build
>   a pipeline with different stages on different machines.  In an ideal
>   world, you'd be able to migrate pipeline stages between machines to
>   load-balance.
> - if you want to generate ten reports for ten different web sites whose
>   logs are interleaved in the same log file, you can run the log into
>   one guy whose job it is to divvy it up, line by line, among ten
>   machines doing analysis, one for each web site.
> - if you're looking for several different kinds of information in the
>   log file --- again, with a high computation-to-data ratio --- you can
>   send a copy of the log file to several processes, each extracting one
>   of the kinds of information.
> 

All good points.  Another good point is that if the reports are the
result of syslogd output, a sensible /etc/syslog.conf can often achieve
a lot of partitioning for you.  If the reports are the result of a
centralized syslog loghost that receives all the syslog output of (say)
100+ hosts, you might look into "syslog-ng", which basically filters
input as it comes into the loghost and squirrels it away in a nice set
of host/loglevel-specific files according to your specification.

Either of these will result in significantly smaller files to process
and a lot of the processing will already be done.

> Of course, all of this depends on the problem.  My guess is that the
> original querent can, as you suggested, rewrite his log-processing
> script in C instead of Perl and get the performance boost he needs, and
> it will be easier than parallelizing by anything but the simplistic
> split-the-log-into-chunks approach.
> 
> [I'm just guessing that the log-processing code is currently in Perl. :) ]

Agreed and agreed.

    rgb

> -- 
> <kragen at pobox.com>       Kragen Sitaker     <http://www.pobox.com/~kragen/>
> The Internet stock bubble didn't burst on 1999-11-08.  Hurrah!
> <URL:http://www.pobox.com/~kragen/bubble.html>
> The power didn't go out on 2000-01-01 either.  :)
> 
> 
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf
> 

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu