[Beowulf] strange problem with large file moving between server
j.sassmannshausen at ucl.ac.uk
Sun Sep 21 10:17:43 PDT 2014
thanks for the feedback.
I still got the original data so that is not a problem right now. What worries
me is even if I restore the data right now can I trust the system?
It is a RAID5 I am using and the discs are new. I have formated the disc space
on Thursday so the file system is new as wll.
What I found on the front end is that in syslog:
mptbase: ioc0: LogInfo(0x11080000): F/W: Outbound DMA Overrun
And I get that a few times. So either the controller on the front end got a
problem which I did not see with the older Infortrend box as it is slower and
hence the controller is less active, or the controller at the Infortrend box
got a problem.
I don't know whether the Infortrend box does scrubbing. I have not activated
something here and I am just using the standart settings.
Regarding ZFS: is that available for Linux now? I lost a bit track here.
All the best from London
On Sonntag 21 September 2014 you wrote:
> Hi Jörg,
> Sounds like a "typical" but very uncommon silent data corruption problem.
> If you have another copy of the data, compare to that? If you don't have
> another copy, accept the fact that some of your data maybe got silently
> Most RAID controllers do periodic "scrubbing"; was your Infortrend doing
> For the new system, consider using ZFS pointed at plain disks, as it may
> have more layers of checksums compared to your current system.
> On Sunday, September 21, 2014, Jörg Saßmannshausen <
> j.sassmannshausen at ucl.ac.uk> wrote:
> > Dear all,
> > I got a rather strange problem with one of my file servers which I
> > recently have upgraded in order to accommodate more disc space.
> > The problem: I have copies the files from the old file space to a
> > temporary disc
> > storage space using this rsync command:
> > rsync -vrltH -pgo --stats -D --numeric-ids -x oldserver:foo
> > tempspace:baa
> > I am doing this now for some years and never had any problems.
> > As always, I am running md5sum afterwards to be sure ther is not a
> > problem later and the user is loosing data. This time around a rather
> > large file (around 16 GB) the md5sum failed after I moved the files from
> > the temp space
> > back to the new destination using the same command as above.
> > Having still access to the old file space, I decided to move this file
> > from the
> > old file space. Strangely enough, rsync does not sync the file again so I
> > had to
> > delete the file. Even after deleting the file and re-sync it from the old
> > source, the md5sum is wrong.
> > Copying the file to a different file space did not cause these problem,
> > i.e. the
> > md5sum is correct.
> > As it is a tar.gz file, I simply decided to decompress the original file
> > on the
> > different file server. That worked. The file where the md5sum is wrong
> > did not
> > decompress on the different file server but crashed with an error message
> > when I
> > executed gunzip. So the file is broken.
> > The setup:
> > Originally I was using an old Infortrand box which had old PATA discs in
> > it.
> > This box is connected via scsi to a frontend server which exports the
> > file space via iscsi. The backend for that, i.e. the one the user is
> > accessing is
> > on a different physical machine and it is a XEN guest. The reason behind
> > that
> > setting is as the frontend is acting as a backup server and I don't want
> > people to have access to it.
> > I then exchanged the Infortrend box with a more recent model which got
> > SATA capeabilities but still got scsi connection to the frontend. The
> > frontend is
> > the same. I got a new controller for that box as the old one was broken.
> > There is no changes in the backend, that is still the same XEN guest on
> > the same hardware.
> > What I cannot work out is why the old Infortrend box does not have any
> > problems with the new file, the newer one has a problem here. Also, when
> > I have
> > copied over some files (again using the rsync command above) a few files
> > did not
> > copy correctly (again md5sum) in the first instance but done so later.
> > I find that highly alarming as that means that at least for larger and/or
> > some
> > binary files there seems to be a problem. However, I am not sure there to
> > look
> > at it as I am out of ideas.
> > Could it be there is a problem with the 'new' controller?
> > In all cases I was using ext4 as a file system and I did not have any
> > problems
> > with that.
> > Anybody got some sentiments here?
> > All the best from a sunny London
> > Jörg
> > P.S. To make things worse I am off on a work related trip from Monday
> > onwards
> > and I am working on that problem since Friday evening.
> > --
> > *************************************************************
> > Dr. Jörg Saßmannshausen, MRSC
> > University College London
> > Department of Chemistry
> > Gordon Street
> > London
> > WC1H 0AJ
> > web: http://sassy.formativ.net
> > Please avoid sending me Word or PowerPoint attachments.
> > See http://www.gnu.org/philosophy/no-word-attachments.html
Dr. Jörg Saßmannshausen, MRSC
University College London
Department of Chemistry
email: j.sassmannshausen at ucl.ac.uk
Please avoid sending me Word or PowerPoint attachments.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 230 bytes
Desc: This is a digitally signed message part.
More information about the Beowulf