[Beowulf] Compressor LZO versus others

Sun Aug 19 08:45:29 PDT 2012

On 08/19/2012 09:09 AM, Vincent Diepeveen wrote:
> Here is the results:
>
> The original file used for every compressor. A small EGTB of 1.8GB:
>
> -rw-rw-r--. 1 diep diep 1814155128 Aug 19 10:37 knnknp_w.dtb
>
> LZO (default compression):
>
> -rw-rw-r--. 1 diep diep  474233006 Aug 19 10:37 knnknp_w.dtb.lzo
>
> 7-zip (default compression):
>
> -rw-rw-r--. 1 diep diep 160603822 Aug 18 19:33 ../7z/33p/knnknp_w.dtb.7z
>
> Andrew Kadatch:
>
> -rw-rw-r--. 1 diep diep  334258087 Aug 19 14:37 knnknp_w.dtb.emd
>
> We see kadatch is a 140MB smaller in size than LZO, that's a lot at
> 474MB total size for the lzo
> and it's 10% of total size of the original data.
>
> So LZO in fact is so bad it doesn't even beat another Huffman
> compressor. A fast bucket compressor not using a dictionary at all is
> hammering it.

Thanks for these insightful findings Vincent.  Unless I missed 
something, I didn't see timings for these algorithms.  I would be very 
interested to see these compressions wrapped in a 'time' command and 
please make sure to flush your buffer cache in between.  In Hadoop LZO 
seems to be the defacto standard for its widespread use, speed both of 
compression and decompression, and relatively high compression ratio 
compared to very bare-bones compressors.

So seeing these results, alongside the 1) time to compress when data is 
solely on HDD and 2) time to decompress when data is solely on HDD would 
be really, really helpful.

For Hadoop, since compression is mainly used to "package" data up prior 
to network transfer (and obviously it gets "unpackaged" on the other 
side if it needs to be used), the balance between speed and compression 
is a fine balance, dependent on your network and CPU capabilities.

Please let me know if you get around to running these experiments and if 
you find another compressor out there that is excellent and I'll have to 
consider it for my use in Hadoop!

Best,

ellis