[Beowulf] strange problem with large file moving between server
alex.chekholko at gmail.com
Sun Sep 21 10:00:20 PDT 2014
Sounds like a "typical" but very uncommon silent data corruption problem.
If you have another copy of the data, compare to that? If you don't have
another copy, accept the fact that some of your data maybe got silently
Most RAID controllers do periodic "scrubbing"; was your Infortrend doing
For the new system, consider using ZFS pointed at plain disks, as it may
have more layers of checksums compared to your current system.
On Sunday, September 21, 2014, Jörg Saßmannshausen <
j.sassmannshausen at ucl.ac.uk> wrote:
> Dear all,
> I got a rather strange problem with one of my file servers which I recently
> have upgraded in order to accommodate more disc space.
> The problem: I have copies the files from the old file space to a
> temporary disc
> storage space using this rsync command:
> rsync -vrltH -pgo --stats -D --numeric-ids -x oldserver:foo tempspace:baa
> I am doing this now for some years and never had any problems.
> As always, I am running md5sum afterwards to be sure ther is not a problem
> later and the user is loosing data. This time around a rather large file
> (around 16 GB) the md5sum failed after I moved the files from the temp
> back to the new destination using the same command as above.
> Having still access to the old file space, I decided to move this file
> from the
> old file space. Strangely enough, rsync does not sync the file again so I
> had to
> delete the file. Even after deleting the file and re-sync it from the old
> source, the md5sum is wrong.
> Copying the file to a different file space did not cause these problem,
> i.e. the
> md5sum is correct.
> As it is a tar.gz file, I simply decided to decompress the original file
> on the
> different file server. That worked. The file where the md5sum is wrong did
> decompress on the different file server but crashed with an error message
> when I
> executed gunzip. So the file is broken.
> The setup:
> Originally I was using an old Infortrand box which had old PATA discs in
> This box is connected via scsi to a frontend server which exports the file
> space via iscsi. The backend for that, i.e. the one the user is accessing
> on a different physical machine and it is a XEN guest. The reason behind
> setting is as the frontend is acting as a backup server and I don't want
> people to have access to it.
> I then exchanged the Infortrend box with a more recent model which got SATA
> capeabilities but still got scsi connection to the frontend. The frontend
> the same. I got a new controller for that box as the old one was broken.
> There is no changes in the backend, that is still the same XEN guest on the
> same hardware.
> What I cannot work out is why the old Infortrend box does not have any
> problems with the new file, the newer one has a problem here. Also, when I
> copied over some files (again using the rsync command above) a few files
> did not
> copy correctly (again md5sum) in the first instance but done so later.
> I find that highly alarming as that means that at least for larger and/or
> binary files there seems to be a problem. However, I am not sure there to
> at it as I am out of ideas.
> Could it be there is a problem with the 'new' controller?
> In all cases I was using ext4 as a file system and I did not have any
> with that.
> Anybody got some sentiments here?
> All the best from a sunny London
> P.S. To make things worse I am off on a work related trip from Monday
> and I am working on that problem since Friday evening.
> Dr. Jörg Saßmannshausen, MRSC
> University College London
> Department of Chemistry
> Gordon Street
> WC1H 0AJ
> web: http://sassy.formativ.net
> Please avoid sending me Word or PowerPoint attachments.
> See http://www.gnu.org/philosophy/no-word-attachments.html
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf