[Beowulf] Compressor LZO versus others

Sun Aug 19 10:46:57 PDT 2012

On Aug 19, 2012, at 5:45 PM, Ellis H. Wilson III wrote:

> On 08/19/2012 09:09 AM, Vincent Diepeveen wrote:
>> Here is the results:
>>
>> The original file used for every compressor. A small EGTB of 1.8GB:
>>
>> -rw-rw-r--. 1 diep diep 1814155128 Aug 19 10:37 knnknp_w.dtb
>>
>> LZO (default compression):
>>
>> -rw-rw-r--. 1 diep diep  474233006 Aug 19 10:37 knnknp_w.dtb.lzo
>>
>> 7-zip (default compression):
>>
>> -rw-rw-r--. 1 diep diep 160603822 Aug 18 19:33 ../7z/33p/ 
>> knnknp_w.dtb.7z
>>
>> Andrew Kadatch:
>>
>> -rw-rw-r--. 1 diep diep  334258087 Aug 19 14:37 knnknp_w.dtb.emd
>>
>> We see kadatch is a 140MB smaller in size than LZO, that's a lot at
>> 474MB total size for the lzo
>> and it's 10% of total size of the original data.
>>
>> So LZO in fact is so bad it doesn't even beat another Huffman
>> compressor. A fast bucket compressor not using a dictionary at all is
>> hammering it.
>
> Thanks for these insightful findings Vincent.  Unless I missed
> something, I didn't see timings for these algorithms.  I would be very
> interested to see these compressions wrapped in a 'time' command and
> please make sure to flush your buffer cache in between.  In Hadoop LZO
> seems to be the defacto standard for its widespread use, speed both of
> compression and decompression, and relatively high compression ratio
> compared to very bare-bones compressors.

I understand your question about 'time'.

Now i'll do something that seems terrible arrogant, but really i'm no  
expert
on compression. there is a few guys, especially 1 Dutch guy, who has  
a website
DEDICATED measuring everything painfully accurate.

It's better for me to email to him a few EGTBs, like i did do a year  
or 10 ago with him
and have him toy and then you get really any number you want and not  
just 1 compressor.

He'll try a whole range maybe for you.

As for me. I want a BIG space reduction and compressing i do just 1  
time. Decompressing must
be fast however.

The compression rate of LZO is just too laughable to even take  
serious in this case.

It's simply using years 70 algorithms and applies them very bad. It's  
fast sure, but we have plenty of
cores everywhere to compress and especially decompress; didn't even  
speak of GPU's yet.

3 times worse than 7-zip which is a relative fast compressor to  
decompress. At a single core it can write
more than the full write bandwidth of 1 drive here, it's achieving  
the maximum of 1 drive and that's just
simple C code. For windows it seems to have also assembler. Not sure  
whether the linux compile uses that;
I doubt it.

Now i don't know about you, but i have more cores than drives, as  
drives are painfully expensive.

So i'm not interested in : "how much faster does it decompress". I  
simply want a decompress speed where 1 core can get close
to the bandwidth of 1 drive which is simply, if i look to garantuees  
of manufacturers a 60-80 MB/s sustained they
garantuee kind of with peaks up to 133MB/s. I decompressed a terabyte  
last few hours here with 7-zip and it's
achieving a 50MB/s+ a core or so there.

Yet i do have far more cores available than bandwidth to the drives.  
It's easy nowadays to scale cores in fact.

LZO in fact  isn't achieving a much higher speed there obviously,  
simply i tested at 1 drive; so there isn't a faster i/o
speed than 100MB/s simply. So LZO doesn't offer an advantage there.  
It's simply 3x worse as when storing that massive
data that i store, you really want a good space reduction.

For me the space reduction really is *critical* important.

The 7 men are 82TB uncompressed. If i look to an 'attempt' at the 8  
men that's several petabytes. So if i want to
store the final result at an array that's just some dozens of  
terabyte, then i need a good compressor.

LZO loses me factor 3 more in diskspace - that's just not acceptable  
here.

Data that has been finished calculating, i never again basically need  
to modify it. I guess that's true for most.
It does need relative fast decompression though.

In fact if there would be a compressor that can get me a much better  
compression than this yet takes 10 times
the time, and decompresses "only" 2 times slower than 7-zip, i would  
use it.

What i do know from experiments 10 years ago is that the best  
compressor back then managed around a
factor 2.5 smaller result than 7-zip. That was much smaller EGTBs  
that was experimented upon, but of course
it took a single cpu back then (if i remember well a K7 at 1.x Ghz)  
around a 24 hours to compress a testset of a 200MB or so,
and more importantly also the same time to decompress it, and a 400MB  
ram or so.

That was a guy from new zealand with some crappy DOS type interface  
if i remember well...

Note all of this compresses much better for me than Paq what was  
topping the lists back then, but of course a typical
'testset' product. Using algorithm X for data d1 and algorithm Y for  
data d2; just careful parameter tuned - useless compressor
to me when i try to get it to work.

In all this you really want to leave the 80s in compressing  
technology now.

Why isn't there a good open source port that works great for linux  
for 7-zip by the way?

Especially one where you don't need more RAM than the filesize it's  
decompressing...

It's a magnificent compressor and i don't know anything that rivals  
it in terms of speed and the great result it achieves given
the slow speed you've got to the i/o on a core by core basis averaged.

So my hunt for a good compressor under linux isn't over yet...

Kind Regards,
Vincent

>
> So seeing these results, alongside the 1) time to compress when  
> data is
> solely on HDD and 2) time to decompress when data is solely on HDD  
> would
> be really, really helpful.

Yeah well decompressing a terabyte with just 4GB of RAM here i bet it  
all wasn't
on the HDD yet.

>
> For Hadoop, since compression is mainly used to "package" data up  
> prior
> to network transfer (and obviously it gets "unpackaged" on the other
> side if it needs to be used), the balance between speed and  
> compression
> is a fine balance, dependent on your network and CPU capabilities.
>
> Please let me know if you get around to running these experiments  
> and if
> you find another compressor out there that is excellent and I'll  
> have to
> consider it for my use in Hadoop!
>
> Best,
>
> ellis
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
> Computing
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf