[Beowulf] strange problem with large file moving between server
j.sassmannshausen at ucl.ac.uk
Thu Oct 2 15:17:43 PDT 2014
thanks for the feedback.
I can rule out the front end as I was using that with a different disc array
without any problems. So I am somewhat confident that the front end and the
controller are ok.
As for the disc array: I got a new controller here so one would assume that
one is working ok. I am in touch with the manufacturer to see if there is a
problem with that.
I done some stress testing in terms of copying the files over from the old
server to the new server and I did not see any problems here when I was using
a test board, i.e. a different front end with a different controller.
Having said that: I cannot really rule out that the controller I am currently
using might have a problem as: it is a dual controller (two scsi connections)
and one of the boxes which was connected there had a slower transfer rate.
What I do not know is whether then the controller is stepping down and hence
any problems will be masked due to the slower transfer rate.
Unfortunately, like so often, the hardware is in use and needed so I cannot
take it offline too often and then hamper people's work.
Talking about memtest: which one do you suggest? memtest or memtester? I have
heard different opinions about them.
All the best from a mild London
On Donnerstag 02 Oktober 2014 you wrote:
> RAM somewhere could also be faulty. Have a look at the logs for any ECC
> errors (both system memory and RAID controller) and memtest the boxes
> involved for a couple of days. I would suggest some stress testing of the
> new server if not done already.
> Best regards,
> On Sun, Sep 21, 2014 at 3:22 PM, Jörg Saßmannshausen <
> j.sassmannshausen at ucl.ac.uk> wrote:
> > Dear all,
> > I got a rather strange problem with one of my file servers which I
> > recently have upgraded in order to accommodate more disc space.
> > The problem: I have copies the files from the old file space to a
> > temporary disc
> > storage space using this rsync command:
> > rsync -vrltH -pgo --stats -D --numeric-ids -x oldserver:foo
> > tempspace:baa
> > I am doing this now for some years and never had any problems.
> > As always, I am running md5sum afterwards to be sure ther is not a
> > problem later and the user is loosing data. This time around a rather
> > large file (around 16 GB) the md5sum failed after I moved the files from
> > the temp space
> > back to the new destination using the same command as above.
> > Having still access to the old file space, I decided to move this file
> > from the
> > old file space. Strangely enough, rsync does not sync the file again so I
> > had to
> > delete the file. Even after deleting the file and re-sync it from the old
> > source, the md5sum is wrong.
> > Copying the file to a different file space did not cause these problem,
> > i.e. the
> > md5sum is correct.
> > As it is a tar.gz file, I simply decided to decompress the original file
> > on the
> > different file server. That worked. The file where the md5sum is wrong
> > did not
> > decompress on the different file server but crashed with an error message
> > when I
> > executed gunzip. So the file is broken.
> > The setup:
> > Originally I was using an old Infortrand box which had old PATA discs in
> > it.
> > This box is connected via scsi to a frontend server which exports the
> > file space via iscsi. The backend for that, i.e. the one the user is
> > accessing is
> > on a different physical machine and it is a XEN guest. The reason behind
> > that
> > setting is as the frontend is acting as a backup server and I don't want
> > people to have access to it.
> > I then exchanged the Infortrend box with a more recent model which got
> > SATA capeabilities but still got scsi connection to the frontend. The
> > frontend is
> > the same. I got a new controller for that box as the old one was broken.
> > There is no changes in the backend, that is still the same XEN guest on
> > the same hardware.
> > What I cannot work out is why the old Infortrend box does not have any
> > problems with the new file, the newer one has a problem here. Also, when
> > I have
> > copied over some files (again using the rsync command above) a few files
> > did not
> > copy correctly (again md5sum) in the first instance but done so later.
> > I find that highly alarming as that means that at least for larger and/or
> > some
> > binary files there seems to be a problem. However, I am not sure there to
> > look
> > at it as I am out of ideas.
> > Could it be there is a problem with the 'new' controller?
> > In all cases I was using ext4 as a file system and I did not have any
> > problems
> > with that.
> > Anybody got some sentiments here?
> > All the best from a sunny London
> > Jörg
> > P.S. To make things worse I am off on a work related trip from Monday
> > onwards
> > and I am working on that problem since Friday evening.
> > --
> > *************************************************************
> > Dr. Jörg Saßmannshausen, MRSC
> > University College London
> > Department of Chemistry
> > Gordon Street
> > London
> > WC1H 0AJ
> > email: j.sassmannshausen at ucl.ac.uk
> > web: http://sassy.formativ.net
> > Please avoid sending me Word or PowerPoint attachments.
> > See http://www.gnu.org/philosophy/no-word-attachments.html
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> > To change your subscription (digest mode or unsubscribe) visit
> > http://www.beowulf.org/mailman/listinfo/beowulf
Dr. Jörg Saßmannshausen, MRSC
University College London
Department of Chemistry
email: j.sassmannshausen at ucl.ac.uk
Please avoid sending me Word or PowerPoint attachments.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 230 bytes
Desc: This is a digitally signed message part.
More information about the Beowulf