[Beowulf] Rsync - checksums

Bill Wichser bill at princeton.edu
Tue Jun 18 08:00:16 PDT 2019


Well thanks for THAT pointer!  Using --checksum-choice=none results in 
speedup of somewhere between 2-3 times.  That's my validation of the 
checksum theory things have been pointing towards.  Now to get xxhash 
into rsync and I think we are all set.

Thanks,
Bill

On 6/18/19 9:57 AM, Ellis H. Wilson III wrote:
> On 6/18/19 9:16 AM, Bill Wichser wrote:
>> Stock RH 7 version, rsync-3.1.2-6.el7_6.1.x86_64.  We've tried a 
>> number of recompiles.  gcc, Intel.  The only thing between identical 
>> compiles was the md4 vs md5.
>>
>> /bin/rsync -lptgoDAH -v --numeric-ids -d --relative --delete 
>> --delete-after --files-from=...
>>
>> I'm not asking for help.  Just if anyone had attempted to change the 
>> algorithm into something much faster.
>>
>> I refer you to this project https://cyan4973.github.io/xxHash/ where 
>> there is a table of speeds.  Regardless of what anyone might 
>> speculate, we are pursuing this route of changing out the algorithm.  
>> Maybe it's all for naught.  Maybe it isn't.  But in a few weeks 
>> hopefully we'll have determined.
> 
> Very interesting.  From the rsync man page:
> 
> "Note that rsync always verifies that each transferred file was 
> correctly reconstructed  on  the  receiving  side  by checking  a 
> whole-file checksum that is generated as the file is transferred, but 
> that automatic after-the-transfer verification has nothing to do with 
> this option’s before-the-transfer "Does this file need to be updated?" 
> check."
> 
> So it sounds like you have sufficient churn in large files that the 
> checksum validation post-transfer is your bottleneck.  Short of hacking 
> rsync to use a faster algorithm, your remaining choice is to use the 
> --checksum-choice=STR and set it to none, and then perform your own 
> hashing out-of-band to check the transferred data using the list you 
> have provided via in files-from.  This will nerf rsync's ability to do 
> delta-transfer, which may be ok depending on the nature of your churning 
> files.  If your pipes are huge (atypical for DR), your CPU is weak, and 
> your churning data is mostly completely new or completely changed files, 
> --checksum-choice=none may work very well for you.
> 
> Best,
> 
> ellis
> 


More information about the Beowulf mailing list