quick question

Tue Jun 20 17:12:13 PDT 2000

Kragen Sitaker wrote:
> 
> W Bauske writes:
> > Also, do you control the source code that does the processing?
> > If not, then the only way to split the work is split the log into
> > chunks and run the log processing on each chunk. Then you have
> > the question of is the data partitionable such that you get the
> > same analysis when it's split.
> 
> This is not correct. 

We'll see, see comments below.

> There are several ways to partition problems in
> general, and log-processing problems in particular, and splitting up
> the input data is only one of them.
> 
> Some examples:
> 
> - if you're running a pipelinable problem --- separable, sequential
>   stages, each with a relatively high computation-to-data ratio (say, a
>   billion or more instructions for every twelve megabytes, thus a
>   thousand instructions for every twelve bytes or so) --- you can build
>   a pipeline with different stages on different machines.  In an ideal
>   world, you'd be able to migrate pipeline stages between machines to
>   load-balance.

Pipelining is good if the processing stages are dependent. 
The original request is too vague to say whether it would work 
though. One could always call the "chunk" the whole file and 
give it to separate programs on separate machines, depending 
on whether the processing is dependent or not on previous 
steps, similar to your last example below.

> - if you want to generate ten reports for ten different web sites whose
>   logs are interleaved in the same log file, you can run the log into
>   one guy whose job it is to divvy it up, line by line, among ten
>   machines doing analysis, one for each web site.

This is just chunking it in a special way. I didn't specify
HOW to chunk it. You assumed I meant a simple chunking.
One can always specify many ways to split the data up, depending
on specific processing requirements.

> - if you're looking for several different kinds of information in the
>   log file --- again, with a high computation-to-data ratio --- you can
>   send a copy of the log file to several processes, each extracting one
>   of the kinds of information.

Same problem as above. It's just another form of chunking.
I was vague about what I meant by chunking on purpose figuring
there would be more questions.

> 
> Of course, all of this depends on the problem.  My guess is that the
> original querent can, as you suggested, rewrite his log-processing
> script in C instead of Perl and get the performance boost he needs, and
> it will be easier than parallelizing by anything but the simplistic
> split-the-log-into-chunks approach.
> 

You assumed the split method. I didn't specify an implementation.
Most likely the log is already partitioned in a simple time dependent 
manner so it can be processed offline. I doubt it's done in real
time. So, if one can tolerate time splitting already, then it is 
likely one can partition into 12/6/4/3/2/1/etc. hour chunks and combine
those results to get a picture of what happened for the whole log
time frame.

We agree on tuning. Spending 12 hours running perl is not
such a good plan. Again, though, I was not specific on purpose.
Just tune it, whatever that means for the specific problem.
If one doesn't know how to "tune it", they should describe
the problem and ask for advice.

Wes