[Beowulf] Rsync - checksums

Michael Di Domenico mdidomenico4 at gmail.com
Mon Jun 17 07:39:24 PDT 2019


rsync on 10PB sounds painful.  i haven't used GPFS in a very long
time, so i might have a gap in knowledge.  but i would be surprised if
GPFS doesn't have a changelog, where you can watch the files that
changed through the day and only copy the ones that did?  much like
what robinhood does for lustre.

On Mon, Jun 17, 2019 at 9:44 AM Bill Wichser <bill at princeton.edu> wrote:
>
> We have moved to a rsync disk backup system, from TSM tape, in order to
> have a DR for our 10 PB GPFS filesystem.  We looked at a lot of options
> but here we are.
>
> md5 checksums take a lot of compute time with huge files and even with
> millions of smaller ones.  The bulk of the time for running rsync is
> spent in computing the source and destination checksums and we'd like to
> alleviate that pain of a cryptographic algorithm.
>
> Googling around, I found no mention of using a technique like this to
> improve rsync performance.  I did find reference to a few hashing
> algorithms though which could certainly work here (xxhash, murmurhash,
> sbox, cityhash64).
>
> Rsync has certainly been around for a few years!  We are going to pursue
> changing the current checksum algorithm and using something much faster.
>   If anyone has done this already and would like to share their
> experiences that would be wonderful. Ideally this could be some optional
> plugin for rsync where users could choose which checksummer to use.
>
> Bill
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf


More information about the Beowulf mailing list