[Beowulf] Rsync - checksums

Ellis H. Wilson III ellis at ellisv3.com
Mon Jun 17 08:29:16 PDT 2019

On 6/17/19 11:12 AM, bill at princeton.edu wrote:
> It's not a GPFS issue per se.  The changelog isn't quite there right now 
> but will be.  Today the question only is about rsync performance.

Hi Bill,

md5 is a reasonably efficient checksum.  When you start getting into 
SHA-128 and -256 is when things start to get a little more expensive.  I 
would be really surprised if you were CPU-bound rather than I/O-bound 
for this.

By default rsync does not operate using checksums, and therefore does 
not need to read in each file in its entirety to see if it should be 
updated.  Do you have a strong reason for using the --checksum option? 
Users typically have to try pretty hard to do things that circumvent the 
heuristic rsync uses by default.

If you need guaranteed DR even in the face of a file that falls outside 
of the typical rsync heuristic, you're best served by leveraging some 
part of the underlying filesystems feature set to achieve this.  It's 
the only one that's going to be able to trivially compute what changed 
and track that.  In my day job at Panasas we designed pan_snap_delta 
explicitly for this -- to be able to efficiently emit a succinct list of 
files and directories which have changed in any way between snapshots, 
and our customers have used that paired with the rsync --files-from 
option to great effect.  Two summers back I added another utility to 
that mix, pan_snap_replicator, which could figure out exactly how a file 
or folder had changed, which ends up being crucial for situations like 
"we moved our 1PB directory of stuff from /a to /b."  rsync regularly 
will cope with this via deleting the entire dir on the remote side and 
copying it over the wire, which is clearly undesirable.

Outside of leveraging filesystem-specific features like that, which GPFS 
may or may not offer, I don't have any better suggestions for you.  But 
I do suspect you're I/O-bound here and md5 itself is not the problem.



Ellis H. Wilson III, Ph.D.

More information about the Beowulf mailing list