[Beowulf] strange problem with large file moving between server

Dimitris Zilaskos dimitrisz at gmail.com
Fri Oct 3 02:28:40 PDT 2014


Hi,

memtest86+ from memtest.org will detect most common memory issues - though
it may need to run for a couple of days. Since everything used to work
fine, maybe it is a good idea to focus on the new hardware. It is not
unusual for brand new equipment to be faulty.

Cheers,

Dimitris



On Thu, Oct 2, 2014 at 11:17 PM, Jörg Saßmannshausen <
j.sassmannshausen at ucl.ac.uk> wrote:

> Hi Dimitris
>
> thanks for the feedback.
>
> I can rule out the front end as I was using that with a different disc
> array
> without any problems. So I am somewhat confident that the front end and the
> controller are ok.
>
> As for the disc array: I got a new controller here so one would assume that
> one is working ok. I am in touch with the manufacturer to see if there is a
> problem with that.
>
> I done some stress testing in terms of copying the files over from the old
> server to the new server and I did not see any problems here when I was
> using
> a test board, i.e. a different front end with a different controller.
>
> Having said that: I cannot really rule out that the controller I am
> currently
> using might have a problem as: it is a dual controller (two scsi
> connections)
> and one of the boxes which was connected there had a slower transfer rate.
> What I do not know is whether then the controller is stepping down and
> hence
> any problems will be masked due to the slower transfer rate.
>
> Unfortunately, like so often, the hardware is in use and needed so I cannot
> take it offline too often and then hamper people's work.
>
> Talking about memtest: which one do you suggest? memtest or memtester? I
> have
> heard different opinions about them.
>
> All the best from a mild London
>
> Jörg
>
> On Donnerstag 02 Oktober 2014 you wrote:
> > Hello,
> >
> > RAM somewhere could also be faulty. Have a look at the logs for any ECC
> > errors (both system memory and RAID controller) and memtest the boxes
> > involved for a couple of days. I would suggest some stress testing of the
> > new server if not done already.
> >
> > Best regards,
> >
> > Dimitris
> >
> >
> >
> > On Sun, Sep 21, 2014 at 3:22 PM, Jörg Saßmannshausen <
> >
> > j.sassmannshausen at ucl.ac.uk> wrote:
> > > Dear all,
> > >
> > > I got a rather strange problem with one of my file servers which I
> > > recently have upgraded in order to accommodate more disc space.
> > >
> > > The problem: I have copies the files from the old file space to a
> > > temporary disc
> > > storage space using this rsync command:
> > >
> > > rsync -vrltH -pgo --stats -D --numeric-ids -x oldserver:foo
> > > tempspace:baa
> > >
> > > I am doing this now for some years and never had any problems.
> > >
> > > As always, I am running md5sum afterwards to be sure ther is not a
> > > problem later and the user is loosing data. This time around a rather
> > > large file (around 16 GB) the md5sum failed after I moved the files
> from
> > > the temp space
> > > back to the new destination using the same command as above.
> > >
> > > Having still access to the old file space, I decided to move this file
> > > from the
> > > old file space. Strangely enough, rsync does not sync the file again
> so I
> > > had to
> > > delete the file. Even after deleting the file and re-sync it from the
> old
> > > source, the md5sum is wrong.
> > >
> > > Copying the file to a different file space did not cause these problem,
> > > i.e. the
> > > md5sum is correct.
> > > As it is a tar.gz file, I simply decided to decompress the original
> file
> > > on the
> > > different file server. That worked. The file where the md5sum is wrong
> > > did not
> > > decompress on the different file server but crashed with an error
> message
> > > when I
> > > executed gunzip. So the file is broken.
> > >
> > > The setup:
> > >
> > > Originally I was using an old Infortrand box which had old PATA discs
> in
> > > it.
> > > This box is connected via scsi to a frontend server which exports the
> > > file space via iscsi. The backend for that, i.e. the one the user is
> > > accessing is
> > > on a different physical machine and it is a XEN guest. The reason
> behind
> > > that
> > > setting is as the frontend is acting as a backup server and I don't
> want
> > > people to have access to it.
> > > I then exchanged the Infortrend box with a more recent model which got
> > > SATA capeabilities but still got scsi connection to the frontend. The
> > > frontend is
> > > the same. I got a new controller for that box as the old one was
> broken.
> > > There is no changes in the backend, that is still the same XEN guest on
> > > the same hardware.
> > >
> > > What I cannot work out is why the old Infortrend box does not have any
> > > problems with the new file, the newer one has a problem here. Also,
> when
> > > I have
> > > copied over some files (again using the rsync command above) a few
> files
> > > did not
> > > copy correctly (again md5sum) in the first instance but done so later.
> > >
> > > I find that highly alarming as that means that at least for larger
> and/or
> > > some
> > > binary files there seems to be a problem. However, I am not sure there
> to
> > > look
> > > at it as I am out of ideas.
> > >
> > > Could it be there is a problem with the 'new' controller?
> > > In all cases I was using ext4 as a file system and I did not have any
> > > problems
> > > with that.
> > >
> > > Anybody got some sentiments here?
> > >
> > > All the best from a sunny London
> > >
> > > Jörg
> > >
> > > P.S. To make things worse I am off on a work related trip from Monday
> > > onwards
> > > and I am working on that problem since Friday evening.
> > >
> > >
> > >
> > > --
> > > *************************************************************
> > > Dr. Jörg Saßmannshausen, MRSC
> > > University College London
> > > Department of Chemistry
> > > Gordon Street
> > > London
> > > WC1H 0AJ
> > >
> > > email: j.sassmannshausen at ucl.ac.uk
> > > web: http://sassy.formativ.net
> > >
> > > Please avoid sending me Word or PowerPoint attachments.
> > > See http://www.gnu.org/philosophy/no-word-attachments.html
> > >
> > > _______________________________________________
> > > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
> > > To change your subscription (digest mode or unsubscribe) visit
> > > http://www.beowulf.org/mailman/listinfo/beowulf
>
>
> --
> *************************************************************
> Dr. Jörg Saßmannshausen, MRSC
> University College London
> Department of Chemistry
> Gordon Street
> London
> WC1H 0AJ
>
> email: j.sassmannshausen at ucl.ac.uk
> web: http://sassy.formativ.net
>
> Please avoid sending me Word or PowerPoint attachments.
> See http://www.gnu.org/philosophy/no-word-attachments.html
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20141003/74355514/attachment.html>


More information about the Beowulf mailing list