[Beowulf] Accelerator for data compressing

Fri Oct 3 03:29:16 PDT 2008

hi Bill,

7-zip is of course faster than bzip2 and a bit slower than gzip.
Thing is that after the files have been compressed you need to do  
less i/o;
the big bottleneck in most systems is the bandwidth from and to i/o.

A single disk start of 90s was several megabytes per second up to 10  
MB/s.
Speed i see SATA2 drives get now is roughly 133MB/s.

I remember a speech from NCSA director a few weeks ago where he  
showed a graph,
and it's quite obvious that we are gonna get enough cpu power in future,
what to do with it?

Let's be realistic. Suppose you've got a bunch of cores. Soon hundreds.
Then for each byte you get off disk, you have hundreds of cpu cycles  
to decompress.

That bandwidth to and from i/o you can limit bigtime with a bigger  
compression,
cpu power there is enough, provided it's not taking too long to  
decompress.

Decompression speed of 7-zip is faster than bzip2 in my experience,  
simply
because it can decompress in parallel and i don't see bzip2 ever  
getting parallellized.

Compression is one of those areas where there gets put little money  
in making a real good
realtime compression. In fact i'm busy making a kind of filesystem  
myself now where compression
is allowed to take long and decompression must be realtime (simply at  
a given cpu speed a certain
bandwidth to decompress) for any bucket.

Best compressors today have just 1 basic problem for GIS type and in  
my case chess database
systems; they decompress the entire file.

With files of gigabytes, if you just want to access random byte  
number X in the file, you're not interested
in bandwidth nor whether gzip is faster or slower than PPM type  
compressors in decompression.

You just want that single byte that might be at the end of the file.

I'm planning to build such a compressor which is doing a lot of  
mathematical function comparisions to
get PPM type compression, yet taking care decompression can be done  
realtime.

there is a few filesystems which allow this, but the compression they  
use for it, is just 70s/80s huffman based.
Real ugly bad, to say polite.

Note i'm speaking of lossless compression here, lossy compression is  
yet another subject.
On the web if you ask me, too many photo's are of a quality that is  
just ugly. Lots to improve there as well.
Lossy you can do way more of course than lossless.

Yet lossless i do not see real great systems there yet.

Time to build one myself :)

My argument is that something like a GPU might be real interesting to  
take a look at for compression.
You can dedicate a bunch of cheapo GPU's to just compression; for  
databases what really matters is the
decompression time.

They store so much, i'd argue a good compression, both lossless and  
lossy is what we need;
Nowadays governments want to put in database every phoneconversation,  
no matter how hard they're gonna deny that
to you;

I'd forsee that the amount of storage needed is huge, tiny company i  
worked for was putting away in petabyte level,
and majority will never get decompressed, other than for some AI type  
software programs that do some scanning later on.

The real big difference between the 'testsets' that get used in  
compression and the status of compression, is that there is no
money to make for those guys writing a superior compressor that works  
in terabytes. Companies do not soon invest in such
technologies, they just invest in products they can sell.

Having also worked with GIS data i see possibilities to compress that  
factors better and still realtime look it up from embedded cpu's.

Vincent

On Oct 3, 2008, at 11:17 AM, Bill Broadley wrote:

> Vincent Diepeveen wrote:
>> Bzip2, gzip,
>> Why do you guys keep quoting those total outdated compressors :)
>
> Path of least resistance, not to mention python bindings.
>
>> there is 7-zip for linux, it's open source and also part of LZMA.  
>> On average remnants
>> are 2x smaller than what gzip/bzip2 is doing for you (so bzip2/ 
>> gzip is factor 2 worse).
>> 7-zip also works parallel, not sure whether it works in linux  
>> parallel. 7za is command line
>> version.
>
> Seems like the question is related to CPU utilization as well as  
> compression ratios.  Assuming the TIFF files are not already  
> compressed, how fast would you expect 7-zip to be relative to bzip2  
> and gzip's compression and decompression speeds?  I was looking for  
> decent bandwidth, and I did look around a bit and it seemed like  
> things often would compress somewhat better, often the bandwidth  
> achieved was 5-6x worse.  So for squeezing the most out of a 28k  
> modem... sure.  For keeping up with a 100mbit or GigE connection on  
> a local LAN, not so much.
>
> Google finds:
> http://blogs.reucon.com/srt/2008/02/18/ 
> compression_gzip_vs_bzip2_vs_7_zip.html
>
> Compressor 	Size 	Ratio 	Compression 	Decompression
> gzip 	        89 MB 	54 % 	0m 13s 	0m 05s
> bzip2 	        81 MB 	49 % 	1m 30s 	0m 20s
> 7-zip 	        61 MB 	37 % 	1m 48s 	0m 11s
>
> So sure you save 28MB, at the cost of 95 seconds.  Might make sense  
> if you are transfering over a slow modem.  Also considering the  
> original file was 163MB it's nowhere near the 6MB/sec that seems to  
> be the target.  At 1.5MB/sec you'd need 4 CPUs running flat out for  
> 2 days to manage 2TB, instead of 1 CPU running for just 24 hours.   
> Definitely the kind of thing that sounds like it might make a big  
> difference.
>
> Another example:
> http://bbs.archlinux.org/viewtopic.php?t=11670
>
> 7zip compress: 19:41
> Bzip2 compress:  8:56
> Gzip compress:  3:00
>
> Again 7zip is a factor of 6 and change slower than gzip.
>
>> Linux distributions should include it default.
>> Uses PPM, that's a new form of multidimensional compression that  
>> all that old junk like
>> bzip2/gzip doesn't use.
>
> One man's junk and another man's gold.  My use was backup related  
> and I definitely didn't want to become CPU limited even on large  
> systems with 10TB of disk and a healthy I/O system.  From the  
> sounds of it even with 8 fast cores that 7zip might easily be the  
> bottleneck.
>
>> TIFF files compress real bad of course. Maybe convert them to some  
>> more inefficient format,
>> which increases its size probably, which then compresses real  
>> great with PPM.
>
> Er, that makes no sense to me.  You aren't going to end up with a  
> smaller file by encoding a file less efficiently.. under ideal  
> circumstances you might get back to where you started with a  
> substantial use of cycles.  Seems pretty simple, if the TIFFs are  
> compressed, just send them as is, significant additional  
> compression is unlikely.  If they are uncompressed there's a decent  
> chance of significant lossless compression, the best thing to do  
> would be to try it or at least a reference to some similar images.
>