[Beowulf] Accelerator for data compressing

Fri Oct 3 08:55:08 PDT 2008

The question is Joe,

Why are you storing it uncompressed?

Vincent

On Oct 3, 2008, at 5:45 PM, Joe Landman wrote:

> Carsten Aulbert wrote:
>
>> If 7-zip can only compress data at a rate of less than say 5 MB/s  
>> (input
>> data) I can much much faster copy the data over uncompressed  
>> regardless
>> of how many unused cores I have in the system. Exactly for these  
>> cases I
>> would like to use all cores available to compress the data fast in  
>> order
>> to increase the throughput.
>
> This is fundamentally the issue.  If the compression time plus the  
> tranmit time for the compressed data is greater than the transmit  
> time for the uncompressed data, then the compression may not be  
> worth it. Sure, if it is nothing but text files, you may get 60-80+ 
> % compression rates.  But for the case of (non-pathological) binary  
> data, it might be only a few percent.   So in this case, even if  
> you could get a few percent delta from the compression, is that  
> worth all the extra time you spend to get it?
>
> At the end of the day the question is how much lossless compression  
> can you do in a short enough time for it to be meaningful in terms  
> of transmitting the data?
>
>> Do I miss something vital?
>
> Nope.  You got it nailed.
>
> Several months ago, I tried moving about 600 GB of data from an old  
> server to a JackRabbit.  The old server and the JackRabbit had a  
> gigabit link between them.  We regularly saw 45 MB scp rates (one  
> of the chips in the older server was a Broadcom).
>
> I tried this with and without compression.  With compression  
> (simple gzip), the copy took something like 28 hours ( a little  
> more than a day).  Without compression, it was well under 10 hours.
>
> In this case, compression (gzip) was not worth it.  The command I  
> used for the test was
>
> uncompressed:
>
> 	cd /directory
> 	tar -cpf - ./ | ssh jackrabbit "cd /directory ; tar -xpvf - "
>
> compressed:
>
> 	cd /directory
> 	tar -czpf - ./ | ssh jackrabbit "cd /directory ; tar -xzpvf - "
>
> if you want to spend more time, use "j" rather than "z" in the  
> options.
>
> YMMV, but I have been convinced that, apart from specific use cases  
> with text only documents or documents known to compress quickly/ 
> well, that compression prior to transfer may waste more time than  
> it saves.
>
> This said, if someone has a parallel hack of gzip or similar we can  
> pipe through, by all means, I would be happy to try it. But it  
> would have to be pretty darned efficient.
>
> 100MB/s means 1 byte transmitted,on average, in 10 nanoseconds.   
> Which means for compression to be meaningful, you would need to  
> compute for less time than that to increase the information  
> density.  Put another way, 1 MB takes about 10 ms to send over a  
> gigabit link.  For compression to be meaningful, you need to  
> compress this 1MB in far less than 10 ms and still transmit it in  
> that time.  Assuming that any compression algorithm has to walk  
> through data at least once,  A 1 GB/s memory subsystem takes about  
> 1 ms to walk through this data at least once, so you need as few  
> passes as possible through the data set to construct the compressed  
> representation, as you will still have on the order of 1E+5 bytes  
> to send.
>
> I am not saying it is hopeless, just hard for complex compression  
> schemes (bzip2, etc).  When we get enough firepower in the CPU (or  
> maybe GPU ... hmmmm) the situation may improve.
>
> GPU as a compression engine?  Interesting ...
>
> Joe
>
>> Cheers
>> Carsten
>
> -- 
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics LLC,
> email: landman at scalableinformatics.com
> web  : http://www.scalableinformatics.com
>        http://jackrabbit.scalableinformatics.com
> phone: +1 734 786 8423 x121
> fax  : +1 866 888 3112
> cell : +1 734 612 4615
>