[Beowulf] Rsync - checksums
Ellis H. Wilson III
ellis at ellisv3.com
Mon Jun 17 08:29:16 PDT 2019
On 6/17/19 11:12 AM, bill at princeton.edu wrote:
> It's not a GPFS issue per se. The changelog isn't quite there right now
> but will be. Today the question only is about rsync performance.
Hi Bill,
md5 is a reasonably efficient checksum. When you start getting into
SHA-128 and -256 is when things start to get a little more expensive. I
would be really surprised if you were CPU-bound rather than I/O-bound
for this.
By default rsync does not operate using checksums, and therefore does
not need to read in each file in its entirety to see if it should be
updated. Do you have a strong reason for using the --checksum option?
Users typically have to try pretty hard to do things that circumvent the
heuristic rsync uses by default.
If you need guaranteed DR even in the face of a file that falls outside
of the typical rsync heuristic, you're best served by leveraging some
part of the underlying filesystems feature set to achieve this. It's
the only one that's going to be able to trivially compute what changed
and track that. In my day job at Panasas we designed pan_snap_delta
explicitly for this -- to be able to efficiently emit a succinct list of
files and directories which have changed in any way between snapshots,
and our customers have used that paired with the rsync --files-from
option to great effect. Two summers back I added another utility to
that mix, pan_snap_replicator, which could figure out exactly how a file
or folder had changed, which ends up being crucial for situations like
"we moved our 1PB directory of stuff from /a to /b." rsync regularly
will cope with this via deleting the entire dir on the remote side and
copying it over the wire, which is clearly undesirable.
Outside of leveraging filesystem-specific features like that, which GPFS
may or may not offer, I don't have any better suggestions for you. But
I do suspect you're I/O-bound here and md5 itself is not the problem.
Best,
ellis
--
Ellis H. Wilson III, Ph.D.
www.ellisv3.com
More information about the Beowulf
mailing list