[Beowulf] Accelerator for data compressing

Vincent Diepeveen diep at xs4all.nl
Fri Oct 3 01:13:16 PDT 2008


Bzip2, gzip,

Why do you guys keep quoting those total outdated compressors :)

there is 7-zip for linux, it's open source and also part of LZMA. On  
average remnants
are 2x smaller than what gzip/bzip2 is doing for you (so bzip2/gzip  
is factor 2 worse).
7-zip also works parallel, not sure whether it works in linux  
parallel. 7za is command line
version.

Linux distributions should include it default.

Uses PPM, that's a new form of multidimensional compression that all  
that old junk like
bzip2/gzip doesn't use.

TIFF files compress real bad of course. Maybe convert them to some  
more inefficient format,
which increases its size probably, which then compresses real great  
with PPM.

When googling for the best compressors, don't try PAQ, that's a  
benchmark compressor. Was worse for my
terabyte of data than even 7-zip (which is not by far best PPM  
compressor, but it's open source).

Vincent

On Oct 3, 2008, at 3:11 AM, Bill Broadley wrote:

> Xu, Jerry wrote:
>> Hello,  Currently I generate nearly one TB data every few days and  
>> I need to pass it
>> along enterprise network to the storage center attached to my HPC  
>> system, I am
>> thinking about compressing it (most tiff format image data)
>
> tiff uncompressed, or tiff compressed files?  If uncompressed I'd  
> guess that
> bzip2 might do well with them.
>
>> as much as I can, as
>> fast as I can before I send it crossing network ... So, I am  
>> wondering whether
>> anyone is familiar with any hardware based accelerator, which can  
>> dramatically
>> improve the compressing procedure..
>
> Improve?  You mean compression ratio?  Wall clock time?  CPU  
> utilization?
> Adding forward error correction?
>
>> suggestion for any file system architecture
>> will be appreciated too..
>
> Er, hard to imagine a reasonable recommendation without much more  
> information.
> Organization, databases (if needed), filenames and related metadata  
> are rather
> specific to the circumstances.  Access patterns, retention time,  
> backups, and many other issues would need consideration.
>
>> I have couple of contacts from some vendors but not
>> sure whether it works as I expected, so if anyone has experience  
>> about it and
>> want to share, it will be really appreciated !
>
> Why hardware?  I have some python code that managed 10MB/sec per  
> CPU (or 80MB
> on 8 CPUs if you prefer) that compresses with zlib, hashes with  
> sha256, and
> encrypts with AES (256 bit key).  Assuming the compression you want  
> isn't
> substantially harder than doing zlib, sha256, and aes a single core  
> from a
> dual or quad core chip sold in the last few years should do fine.
>
> 1TB every 2 days = 6MB/sec or approximately 15% of a quad core or  
> 60% of a
> single core for my compress, hash and encrypt in python.   
> Considering how
> cheap cores are (quad desktops are often under $1k) I'm not sure  
> what would
> justify an accelerator card.  Not to mention picking the particular  
> algorithm
> could make a huge difference to the CPU and compression ratio  
> achieved.  I'd
> recommend taking a stack of real data and trying out different  
> compression
> tools and settings.
>
> In any case 6MB/sec of compression isn't particularly hard these  
> days.... even
> in python on a 1-2 year old mid range cpu.
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf
>




More information about the Beowulf mailing list