[Beowulf] Accelerator for data compressing

Bill Broadley bill at cse.ucdavis.edu
Fri Oct 3 02:17:52 PDT 2008


Vincent Diepeveen wrote:
> Bzip2, gzip,
> 
> Why do you guys keep quoting those total outdated compressors :)

Path of least resistance, not to mention python bindings.

> there is 7-zip for linux, it's open source and also part of LZMA. On 
> average remnants
> are 2x smaller than what gzip/bzip2 is doing for you (so bzip2/gzip is 
> factor 2 worse).
> 7-zip also works parallel, not sure whether it works in linux parallel. 
> 7za is command line
> version.

Seems like the question is related to CPU utilization as well as compression 
ratios.  Assuming the TIFF files are not already compressed, how fast would 
you expect 7-zip to be relative to bzip2 and gzip's compression and 
decompression speeds?  I was looking for decent bandwidth, and I did look 
around a bit and it seemed like things often would compress somewhat better, 
often the bandwidth achieved was 5-6x worse.  So for squeezing the most out of 
a 28k modem... sure.  For keeping up with a 100mbit or GigE connection on a 
local LAN, not so much.

Google finds:
http://blogs.reucon.com/srt/2008/02/18/compression_gzip_vs_bzip2_vs_7_zip.html

Compressor 	Size 	Ratio 	Compression 	Decompression
gzip 	        89 MB 	54 % 	0m 13s 	0m 05s
bzip2 	        81 MB 	49 % 	1m 30s 	0m 20s
7-zip 	        61 MB 	37 % 	1m 48s 	0m 11s

So sure you save 28MB, at the cost of 95 seconds.  Might make sense if you are 
transfering over a slow modem.  Also considering the original file was 163MB 
it's nowhere near the 6MB/sec that seems to be the target.  At 1.5MB/sec you'd 
need 4 CPUs running flat out for 2 days to manage 2TB, instead of 1 CPU 
running for just 24 hours.  Definitely the kind of thing that sounds like it 
might make a big difference.

Another example:
http://bbs.archlinux.org/viewtopic.php?t=11670

7zip compress: 19:41
Bzip2 compress:  8:56
Gzip compress:  3:00

Again 7zip is a factor of 6 and change slower than gzip.

> Linux distributions should include it default.
> 
> Uses PPM, that's a new form of multidimensional compression that all 
> that old junk like
> bzip2/gzip doesn't use.

One man's junk and another man's gold.  My use was backup related and I 
definitely didn't want to become CPU limited even on large systems with 10TB 
of disk and a healthy I/O system.  From the sounds of it even with 8 fast 
cores that 7zip might easily be the bottleneck.

> TIFF files compress real bad of course. Maybe convert them to some more 
> inefficient format,
> which increases its size probably, which then compresses real great with 
> PPM.

Er, that makes no sense to me.  You aren't going to end up with a smaller file 
by encoding a file less efficiently.. under ideal circumstances you might get 
back to where you started with a substantial use of cycles.  Seems pretty 
simple, if the TIFFs are compressed, just send them as is, significant 
additional compression is unlikely.  If they are uncompressed there's a decent 
chance of significant lossless compression, the best thing to do would be to 
try it or at least a reference to some similar images.



More information about the Beowulf mailing list