Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Accelerator for data compressing

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Bill Broadley bill at cse.ucdavis.edu
Fri Oct 3 02:17:52 PDT 2008


Vincent Diepeveen wrote:
> Bzip2, gzip,
> 
> Why do you guys keep quoting those total outdated compressors :)

Path of least resistance, not to mention python bindings.

> there is 7-zip for linux, it's open source and also part of LZMA. On 
> average remnants
> are 2x smaller than what gzip/bzip2 is doing for you (so bzip2/gzip is 
> factor 2 worse).
> 7-zip also works parallel, not sure whether it works in linux parallel. 
> 7za is command line
> version.

Seems like the question is related to CPU utilization as well as compression 
ratios.  Assuming the TIFF files are not already compressed, how fast would 
you expect 7-zip to be relative to bzip2 and gzip's compression and 
decompression speeds?  I was looking for decent bandwidth, and I did look 
around a bit and it seemed like things often would compress somewhat better, 
often the bandwidth achieved was 5-6x worse.  So for squeezing the most out of 
a 28k modem... sure.  For keeping up with a 100mbit or GigE connection on a 
local LAN, not so much.

Google finds:
http://blogs.reucon.com/srt/2008/02/18/compression_gzip_vs_bzip2_vs_7_zip.html

Compressor 	Size 	Ratio 	Compression 	Decompression
gzip 	        89 MB 	54 % 	0m 13s 	0m 05s
bzip2 	        81 MB 	49 % 	1m 30s 	0m 20s
7-zip 	        61 MB 	37 % 	1m 48s 	0m 11s

So sure you save 28MB, at the cost of 95 seconds.  Might make sense if you are 
transfering over a slow modem.  Also considering the original file was 163MB 
it's nowhere near the 6MB/sec that seems to be the target.  At 1.5MB/sec you'd 
need 4 CPUs running flat out for 2 days to manage 2TB, instead of 1 CPU 
running for just 24 hours.  Definitely the kind of thing that sounds like it 
might make a big difference.

Another example:
http://bbs.archlinux.org/viewtopic.php?t=11670

7zip compress: 19:41
Bzip2 compress:  8:56
Gzip compress:  3:00

Again 7zip is a factor of 6 and change slower than gzip.

> Linux distributions should include it default.
> 
> Uses PPM, that's a new form of multidimensional compression that all 
> that old junk like
> bzip2/gzip doesn't use.

One man's junk and another man's gold.  My use was backup related and I 
definitely didn't want to become CPU limited even on large systems with 10TB 
of disk and a healthy I/O system.  From the sounds of it even with 8 fast 
cores that 7zip might easily be the bottleneck.

> TIFF files compress real bad of course. Maybe convert them to some more 
> inefficient format,
> which increases its size probably, which then compresses real great with 
> PPM.

Er, that makes no sense to me.  You aren't going to end up with a smaller file 
by encoding a file less efficiently.. under ideal circumstances you might get 
back to where you started with a substantial use of cycles.  Seems pretty 
simple, if the TIFFs are compressed, just send them as is, significant 
additional compression is unlikely.  If they are uncompressed there's a decent 
chance of significant lossless compression, the best thing to do would be to 
try it or at least a reference to some similar images.



More information about the Beowulf mailing list