From prentice.bisbal at rutgers.edu  Tue Sep  2 08:20:00 2014
From: prentice.bisbal at rutgers.edu (Prentice Bisbal)
Date: Tue, 02 Sep 2014 11:20:00 -0400
Subject: [Beowulf] mpi slow pairs
In-Reply-To: <5403B042.4050100@unimelb.edu.au>
References: <CABOsP2Ms+NJGTfdDmW2HGBcNKMPspaptXpn8yEMYBMtpLER3-Q@mail.gmail.com>
 <88F9D072D5E6434BB9A49625A73D129269AB87@SRV-vEX2.viglen.co.uk>
 <CABOsP2NpMEYZeSXydpdDqy0CSgyG90MgxH-mW5to2_C6mhLY-w@mail.gmail.com>
 <20140829154957.GA6879@bx9.net> <5403B042.4050100@unimelb.edu.au>
Message-ID: <5405E020.7020303@rutgers.edu>

On 08/31/2014 07:31 PM, Christopher Samuel wrote:
> On 30/08/14 01:49, Greg Lindahl wrote:
>
>> Huh, Intel (PathScale/QLogic) has shipped a NxN debugging program for
>> more than a decade. The first vendor I recall shipping such a program
>> was Microway. I guess it takes a while for good practices to spread
>> throughout our community!
> "The first rule of Infiniband debugging is nobody talks about Infiniband
> debugging".
>
> Got a link for it please?
>

So true. I'd like to see a link, too.

--
Prentice

From prentice.bisbal at rutgers.edu  Tue Sep  2 08:16:27 2014
From: prentice.bisbal at rutgers.edu (Prentice Bisbal)
Date: Tue, 02 Sep 2014 11:16:27 -0400
Subject: [Beowulf] mpi slow pairs
In-Reply-To: <CABOsP2NpMEYZeSXydpdDqy0CSgyG90MgxH-mW5to2_C6mhLY-w@mail.gmail.com>
References: <CABOsP2Ms+NJGTfdDmW2HGBcNKMPspaptXpn8yEMYBMtpLER3-Q@mail.gmail.com>
 <88F9D072D5E6434BB9A49625A73D129269AB87@SRV-vEX2.viglen.co.uk>
 <CABOsP2NpMEYZeSXydpdDqy0CSgyG90MgxH-mW5to2_C6mhLY-w@mail.gmail.com>
Message-ID: <5405DF4B.8020708@rutgers.edu>


On 08/29/2014 11:30 AM, Michael Di Domenico wrote:
> On Fri, Aug 29, 2014 at 9:32 AM, John Hearns <John.Hearns at viglen.co.uk> wrote:
>> I would say the usual tool for that pair-wise comparison is Intel IBM
>> https://software.intel.com/en-us/articles/intel-mpi-benchmarks
>> I hope I have got your requirement correct!
> John,
>
> Close, but not exact.  IMB will test ranks, but will not tell me if a
> specific pair of ranks is slower then others, only the collective of
> the ranks under test.  what i'm looking for is an mpi version of this
>
> for x in node1->node100
> for y in node1->node100
> if x==y then skip
> else mpirun -n 2 -npernode 1 -host $x,$y bwtest > $x$y.log
>
> unfortunately, the mpirun task takes about 3secs per iteration, and
> with 10k iterations, it's going to take along time and i'm being
> impatient.  i've been trying to write the mpi code myself, but my mpi
> is a little rusty so it's slow going...
>
>> Also have you run  ibdiagnet to see if anything is flagged up?
> i've run a multitude of ib diags on the machines, but nothing is
> popping out as wrong.  what's weird is that it's only certain pairing
> of machines not any one machine in general.
>

I find most of the ibdiag* utilities to be of limited value when 
debugging IB issues. Unfortunately, Mellanox's Unified Fabric Manager 
(UFM) seems to be the only tool that's helpful for accurately monitoring 
and identifying issues with IB networks. I've never used UFM myself, but 
my friends at Princeton gave me a demo, and it's seems like a fantastic 
tool.

Unfortunately, it's a commercial product, and probably only works on 
Mellanox hardware (you don't mention whether your using Qlogic or 
Mellanox hardware). The good news is, you can download it and evaluate 
it. I'd give that a try, if I were you.

http://www.mellanox.com/page/products_dyn?product_family=100

--
Prentice


From samuel at unimelb.edu.au  Mon Sep 15 20:34:01 2014
From: samuel at unimelb.edu.au (Christopher Samuel)
Date: Tue, 16 Sep 2014 13:34:01 +1000
Subject: [Beowulf] Anyone else going to Slurm User Group in Lugano?
Message-ID: <5417AFA9.9040902@unimelb.edu.au>

Hi folks,

Anyone else going to the Slurm User Group next week in Lugano in
Switzerland?

cheers,
Chris
-- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci


From samuel at unimelb.edu.au  Fri Sep 19 11:09:25 2014
From: samuel at unimelb.edu.au (Christopher Samuel)
Date: Sat, 20 Sep 2014 04:09:25 +1000
Subject: [Beowulf] NUMA zone_reclaim_mode considered harmful?
Message-ID: <541C7155.2020506@unimelb.edu.au>

Hi folks,

Over on the xCAT mailing list I've been involved in a thread relating
to diverse settings of zone_reclaim_mode across nodes in clusters.

It starts here with Stuarts good description of the problem and
diagnosis:

http://sourceforge.net/p/xcat/mailman/message/32841877/

I did some poking around on our systems and was able to confirm that
whilst our newer iDatplex (dx360 M4's) with SB CPUs all had
zone_reclaim_mode set to 0 our older iDataplex with Nehalems (dx360
M2's) were all 1, along with an older SGI UV10 (Westmere).

The clincher was the fact that on that same cluster we had some IBM
x3690 X5 with Maxx5's which boot an identical diskless image and they
had zone_reclaim_mode set to 0, not 1.

Turns out that this is indeed autotuned by older kernels, with this
text in the kernel Documentation/sysctl/vm.txt

# zone_reclaim_mode is set during bootup to 1 if it is determined
# that pages from remote zones will cause a measurable performance
# reduction. The page allocator will then reclaim easily reusable
# pages (those page cache pages that are currently not used) before
# allocating off node pages.

However, in 3.16 a patch was committed that disabled this auto-tuning,
turning off zone reclamation by default.

It's probably worth checking your own x86-64 systems to see if this
is set for you and benchmarking with it disabled if it is..

Here's that patch with the description..

commit 4f9b16a64753d0bb607454347036dc997fd03b82
Author: Mel Gorman <mgorman at suse.de>
Date:   Wed Jun 4 16:07:14 2014 -0700

    mm: disable zone_reclaim_mode by default
    
    When it was introduced, zone_reclaim_mode made sense as NUMA distances
    punished and workloads were generally partitioned to fit into a NUMA
    node.  NUMA machines are now common but few of the workloads are
    NUMA-aware and it's routine to see major performance degradation due to
    zone_reclaim_mode being enabled but relatively few can identify the
    problem.
    
    Those that require zone_reclaim_mode are likely to be able to detect
    when it needs to be enabled and tune appropriately so lets have a
    sensible default for the bulk of users.
    
    This patch (of 2):
    
    zone_reclaim_mode causes processes to prefer reclaiming memory from
    local node instead of spilling over to other nodes.  This made sense
    initially when NUMA machines were almost exclusively HPC and the
    workload was partitioned into nodes.  The NUMA penalties were
    sufficiently high to justify reclaiming the memory.  On current machines
    and workloads it is often the case that zone_reclaim_mode destroys
    performance but not all users know how to detect this.  Favour the
    common case and disable it by default.  Users that are sophisticated
    enough to know they need zone_reclaim_mode will detect it.
    
    Signed-off-by: Mel Gorman <mgorman at suse.de>
    Acked-by: Johannes Weiner <hannes at cmpxchg.org>
    Reviewed-by: Zhang Yanfei <zhangyanfei at cn.fujitsu.com>
    Acked-by: Michal Hocko <mhocko at suse.cz>
    Reviewed-by: Christoph Lameter <cl at linux.com>
    Signed-off-by: Andrew Morton <akpm at linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds at linux-foundation.org>


-- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci


From j.sassmannshausen at ucl.ac.uk  Sun Sep 21 07:22:21 2014
From: j.sassmannshausen at ucl.ac.uk (=?iso-8859-15?q?J=F6rg_Sa=DFmannshausen?=)
Date: Sun, 21 Sep 2014 15:22:21 +0100
Subject: [Beowulf] strange problem with large file moving between server
Message-ID: <201409211522.25727.j.sassmannshausen@ucl.ac.uk>

Dear all,

I got a rather strange problem with one of my file servers which I recently 
have upgraded in order to accommodate more disc space. 

The problem: I have copies the files from the old file space to a temporary disc 
storage space using this rsync command:

rsync -vrltH -pgo --stats -D --numeric-ids -x oldserver:foo  tempspace:baa

I am doing this now for some years and never had any problems. 

As always, I am running md5sum afterwards to be sure ther is not a problem 
later and the user is loosing data. This time around a rather large file 
(around 16 GB) the md5sum failed after I moved the files from the temp space 
back to the new destination using the same command as above.

Having still access to the old file space, I decided to move this file from the 
old file space. Strangely enough, rsync does not sync the file again so I had to 
delete the file. Even after deleting the file and re-sync it from the old 
source, the md5sum is wrong. 

Copying the file to a different file space did not cause these problem, i.e. the 
md5sum is correct.
As it is a tar.gz file, I simply decided to decompress the original file on the 
different file server. That worked. The file where the md5sum is wrong did not 
decompress on the different file server but crashed with an error message when I 
executed gunzip. So the file is broken. 

The setup:

Originally I was using an old Infortrand box which had old PATA discs in it. 
This box is connected via scsi to a frontend server which exports the file 
space via iscsi. The backend for that, i.e. the one the user is accessing is 
on a different physical machine and it is a XEN guest. The reason behind that 
setting is as the frontend is acting as a backup server and I don't want 
people to have access to it. 
I then exchanged the Infortrend box with a more recent model which got SATA 
capeabilities but still got scsi connection to the frontend. The frontend is 
the same. I got a new controller for that box as the old one was broken.  
There is no changes in the backend, that is still the same XEN guest on the 
same hardware.

What I cannot work out is why the old Infortrend box does not have any 
problems with the new file, the newer one has a problem here. Also, when I have 
copied over some files (again using the rsync command above) a few files did not 
copy correctly (again md5sum) in the first instance but done so later. 

I find that highly alarming as that means that at least for larger and/or some 
binary files there seems to be a problem. However, I am not sure there to look 
at it as I am out of ideas. 

Could it be there is a problem with the 'new' controller?
In all cases I was using ext4 as a file system and I did not have any problems 
with that.

Anybody got some sentiments here?

All the best from a sunny London

J?rg

P.S. To make things worse I am off on a work related trip from Monday onwards 
and I am working on that problem since Friday evening. 


-- 
*************************************************************
Dr. J?rg Sa?mannshausen, MRSC
University College London
Department of Chemistry
Gordon Street
London
WC1H 0AJ 

email: j.sassmannshausen at ucl.ac.uk
web: http://sassy.formativ.net

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 230 bytes
Desc: This is a digitally signed message part.
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20140921/1954601a/attachment.sig>

From alex.chekholko at gmail.com  Sun Sep 21 10:00:20 2014
From: alex.chekholko at gmail.com (Alex Chekholko)
Date: Sun, 21 Sep 2014 10:00:20 -0700
Subject: [Beowulf] strange problem with large file moving between server
In-Reply-To: <201409211522.25727.j.sassmannshausen@ucl.ac.uk>
References: <201409211522.25727.j.sassmannshausen@ucl.ac.uk>
Message-ID: <CALW9tqUL9rF6ncqaeGZTEsknY5-WsQ88jeifYtUUKC2riqG3OQ@mail.gmail.com>

Hi J?rg,

Sounds like a "typical" but very uncommon silent data corruption problem.
If you have another copy of the data, compare to that? If you don't have
another copy, accept the fact that some of your data maybe got silently
corrupted.

Most RAID controllers do periodic "scrubbing"; was your Infortrend doing
that?

For the new system, consider using ZFS pointed at plain disks, as it may
have more layers of checksums compared to your current system.

Regards,
Alex

On Sunday, September 21, 2014, J?rg Sa?mannshausen <
j.sassmannshausen at ucl.ac.uk> wrote:

> Dear all,
>
> I got a rather strange problem with one of my file servers which I recently
> have upgraded in order to accommodate more disc space.
>
> The problem: I have copies the files from the old file space to a
> temporary disc
> storage space using this rsync command:
>
> rsync -vrltH -pgo --stats -D --numeric-ids -x oldserver:foo  tempspace:baa
>
> I am doing this now for some years and never had any problems.
>
> As always, I am running md5sum afterwards to be sure ther is not a problem
> later and the user is loosing data. This time around a rather large file
> (around 16 GB) the md5sum failed after I moved the files from the temp
> space
> back to the new destination using the same command as above.
>
> Having still access to the old file space, I decided to move this file
> from the
> old file space. Strangely enough, rsync does not sync the file again so I
> had to
> delete the file. Even after deleting the file and re-sync it from the old
> source, the md5sum is wrong.
>
> Copying the file to a different file space did not cause these problem,
> i.e. the
> md5sum is correct.
> As it is a tar.gz file, I simply decided to decompress the original file
> on the
> different file server. That worked. The file where the md5sum is wrong did
> not
> decompress on the different file server but crashed with an error message
> when I
> executed gunzip. So the file is broken.
>
> The setup:
>
> Originally I was using an old Infortrand box which had old PATA discs in
> it.
> This box is connected via scsi to a frontend server which exports the file
> space via iscsi. The backend for that, i.e. the one the user is accessing
> is
> on a different physical machine and it is a XEN guest. The reason behind
> that
> setting is as the frontend is acting as a backup server and I don't want
> people to have access to it.
> I then exchanged the Infortrend box with a more recent model which got SATA
> capeabilities but still got scsi connection to the frontend. The frontend
> is
> the same. I got a new controller for that box as the old one was broken.
> There is no changes in the backend, that is still the same XEN guest on the
> same hardware.
>
> What I cannot work out is why the old Infortrend box does not have any
> problems with the new file, the newer one has a problem here. Also, when I
> have
> copied over some files (again using the rsync command above) a few files
> did not
> copy correctly (again md5sum) in the first instance but done so later.
>
> I find that highly alarming as that means that at least for larger and/or
> some
> binary files there seems to be a problem. However, I am not sure there to
> look
> at it as I am out of ideas.
>
> Could it be there is a problem with the 'new' controller?
> In all cases I was using ext4 as a file system and I did not have any
> problems
> with that.
>
> Anybody got some sentiments here?
>
> All the best from a sunny London
>
> J?rg
>
> P.S. To make things worse I am off on a work related trip from Monday
> onwards
> and I am working on that problem since Friday evening.
>
>
>
> --
> *************************************************************
> Dr. J?rg Sa?mannshausen, MRSC
> University College London
> Department of Chemistry
> Gordon Street
> London
> WC1H 0AJ
>
> email: j.sassmannshausen at ucl.ac.uk <javascript:;>
> web: http://sassy.formativ.net
>
> Please avoid sending me Word or PowerPoint attachments.
> See http://www.gnu.org/philosophy/no-word-attachments.html
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20140921/f182ebb0/attachment.html>

From j.sassmannshausen at ucl.ac.uk  Sun Sep 21 10:17:43 2014
From: j.sassmannshausen at ucl.ac.uk (=?iso-8859-15?q?J=F6rg_Sa=DFmannshausen?=)
Date: Sun, 21 Sep 2014 18:17:43 +0100
Subject: [Beowulf] strange problem with large file moving between server
In-Reply-To: <CALW9tqUL9rF6ncqaeGZTEsknY5-WsQ88jeifYtUUKC2riqG3OQ@mail.gmail.com>
References: <201409211522.25727.j.sassmannshausen@ucl.ac.uk>
 <CALW9tqUL9rF6ncqaeGZTEsknY5-WsQ88jeifYtUUKC2riqG3OQ@mail.gmail.com>
Message-ID: <201409211817.47972.j.sassmannshausen@ucl.ac.uk>

Hi Alex,

thanks for the feedback.

I still got the original data so that is not a problem right now. What worries 
me is even if I restore the data right now can I trust the system?
It is a RAID5 I am using and the discs are new. I have formated the disc space 
on Thursday so the file system is new as wll. 

What I found on the front end is that in syslog:

mptbase: ioc0: LogInfo(0x11080000): F/W: Outbound DMA Overrun

And I get that a few times. So either the controller on the front end got a 
problem which I did not see with the older Infortrend box as it is slower and 
hence the controller is less active, or the controller at the Infortrend box 
got a problem.

I don't know whether the Infortrend box does scrubbing. I have not activated 
something here and I am just using the standart settings. 

Regarding ZFS: is that available for Linux now? I lost a bit track here.

All the best from London

J?rg

On Sonntag 21 September 2014 you wrote:
> Hi J?rg,
> 
> Sounds like a "typical" but very uncommon silent data corruption problem.
> If you have another copy of the data, compare to that? If you don't have
> another copy, accept the fact that some of your data maybe got silently
> corrupted.
> 
> Most RAID controllers do periodic "scrubbing"; was your Infortrend doing
> that?
> 
> For the new system, consider using ZFS pointed at plain disks, as it may
> have more layers of checksums compared to your current system.
> 
> Regards,
> Alex
> 
> On Sunday, September 21, 2014, J?rg Sa?mannshausen <
> 
> j.sassmannshausen at ucl.ac.uk> wrote:
> > Dear all,
> > 
> > I got a rather strange problem with one of my file servers which I
> > recently have upgraded in order to accommodate more disc space.
> > 
> > The problem: I have copies the files from the old file space to a
> > temporary disc
> > storage space using this rsync command:
> > 
> > rsync -vrltH -pgo --stats -D --numeric-ids -x oldserver:foo 
> > tempspace:baa
> > 
> > I am doing this now for some years and never had any problems.
> > 
> > As always, I am running md5sum afterwards to be sure ther is not a
> > problem later and the user is loosing data. This time around a rather
> > large file (around 16 GB) the md5sum failed after I moved the files from
> > the temp space
> > back to the new destination using the same command as above.
> > 
> > Having still access to the old file space, I decided to move this file
> > from the
> > old file space. Strangely enough, rsync does not sync the file again so I
> > had to
> > delete the file. Even after deleting the file and re-sync it from the old
> > source, the md5sum is wrong.
> > 
> > Copying the file to a different file space did not cause these problem,
> > i.e. the
> > md5sum is correct.
> > As it is a tar.gz file, I simply decided to decompress the original file
> > on the
> > different file server. That worked. The file where the md5sum is wrong
> > did not
> > decompress on the different file server but crashed with an error message
> > when I
> > executed gunzip. So the file is broken.
> > 
> > The setup:
> > 
> > Originally I was using an old Infortrand box which had old PATA discs in
> > it.
> > This box is connected via scsi to a frontend server which exports the
> > file space via iscsi. The backend for that, i.e. the one the user is
> > accessing is
> > on a different physical machine and it is a XEN guest. The reason behind
> > that
> > setting is as the frontend is acting as a backup server and I don't want
> > people to have access to it.
> > I then exchanged the Infortrend box with a more recent model which got
> > SATA capeabilities but still got scsi connection to the frontend. The
> > frontend is
> > the same. I got a new controller for that box as the old one was broken.
> > There is no changes in the backend, that is still the same XEN guest on
> > the same hardware.
> > 
> > What I cannot work out is why the old Infortrend box does not have any
> > problems with the new file, the newer one has a problem here. Also, when
> > I have
> > copied over some files (again using the rsync command above) a few files
> > did not
> > copy correctly (again md5sum) in the first instance but done so later.
> > 
> > I find that highly alarming as that means that at least for larger and/or
> > some
> > binary files there seems to be a problem. However, I am not sure there to
> > look
> > at it as I am out of ideas.
> > 
> > Could it be there is a problem with the 'new' controller?
> > In all cases I was using ext4 as a file system and I did not have any
> > problems
> > with that.
> > 
> > Anybody got some sentiments here?
> > 
> > All the best from a sunny London
> > 
> > J?rg
> > 
> > P.S. To make things worse I am off on a work related trip from Monday
> > onwards
> > and I am working on that problem since Friday evening.
> > 
> > 
> > 
> > --
> > *************************************************************
> > Dr. J?rg Sa?mannshausen, MRSC
> > University College London
> > Department of Chemistry
> > Gordon Street
> > London
> > WC1H 0AJ
> > 
> > email: j.sassmannshausen at ucl.ac.uk <javascript:;>
> > web: http://sassy.formativ.net
> > 
> > Please avoid sending me Word or PowerPoint attachments.
> > See http://www.gnu.org/philosophy/no-word-attachments.html


-- 
*************************************************************
Dr. J?rg Sa?mannshausen, MRSC
University College London
Department of Chemistry
Gordon Street
London
WC1H 0AJ 

email: j.sassmannshausen at ucl.ac.uk
web: http://sassy.formativ.net

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 230 bytes
Desc: This is a digitally signed message part.
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20140921/29b84f4a/attachment.sig>

From andrew.holway at gmail.com  Sun Sep 21 10:59:00 2014
From: andrew.holway at gmail.com (Andrew Holway)
Date: Sun, 21 Sep 2014 19:59:00 +0200
Subject: [Beowulf] strange problem with large file moving between server
In-Reply-To: <201409211817.47972.j.sassmannshausen@ucl.ac.uk>
References: <201409211522.25727.j.sassmannshausen@ucl.ac.uk>
 <CALW9tqUL9rF6ncqaeGZTEsknY5-WsQ88jeifYtUUKC2riqG3OQ@mail.gmail.com>
 <201409211817.47972.j.sassmannshausen@ucl.ac.uk>
Message-ID: <CAEiui-sdwCEeT14V+BK_VrQfKiiaadVCgM2X-0mvg_usSkRq=g@mail.gmail.com>

>
> Regarding ZFS: is that available for Linux now? I lost a bit track here.
>

Yes.

http://zfsonlinux.org/

I would say its ready for production now. Intel are about to start
supporting it under Lustre in the next couple of months and they are
typically careful about such things.

Cheers,

Andrew


>
> All the best from London
>
> J?rg
>
> On Sonntag 21 September 2014 you wrote:
> > Hi J?rg,
> >
> > Sounds like a "typical" but very uncommon silent data corruption problem.
> > If you have another copy of the data, compare to that? If you don't have
> > another copy, accept the fact that some of your data maybe got silently
> > corrupted.
> >
> > Most RAID controllers do periodic "scrubbing"; was your Infortrend doing
> > that?
> >
> > For the new system, consider using ZFS pointed at plain disks, as it may
> > have more layers of checksums compared to your current system.
> >
> > Regards,
> > Alex
> >
> > On Sunday, September 21, 2014, J?rg Sa?mannshausen <
> >
> > j.sassmannshausen at ucl.ac.uk> wrote:
> > > Dear all,
> > >
> > > I got a rather strange problem with one of my file servers which I
> > > recently have upgraded in order to accommodate more disc space.
> > >
> > > The problem: I have copies the files from the old file space to a
> > > temporary disc
> > > storage space using this rsync command:
> > >
> > > rsync -vrltH -pgo --stats -D --numeric-ids -x oldserver:foo
> > > tempspace:baa
> > >
> > > I am doing this now for some years and never had any problems.
> > >
> > > As always, I am running md5sum afterwards to be sure ther is not a
> > > problem later and the user is loosing data. This time around a rather
> > > large file (around 16 GB) the md5sum failed after I moved the files
> from
> > > the temp space
> > > back to the new destination using the same command as above.
> > >
> > > Having still access to the old file space, I decided to move this file
> > > from the
> > > old file space. Strangely enough, rsync does not sync the file again
> so I
> > > had to
> > > delete the file. Even after deleting the file and re-sync it from the
> old
> > > source, the md5sum is wrong.
> > >
> > > Copying the file to a different file space did not cause these problem,
> > > i.e. the
> > > md5sum is correct.
> > > As it is a tar.gz file, I simply decided to decompress the original
> file
> > > on the
> > > different file server. That worked. The file where the md5sum is wrong
> > > did not
> > > decompress on the different file server but crashed with an error
> message
> > > when I
> > > executed gunzip. So the file is broken.
> > >
> > > The setup:
> > >
> > > Originally I was using an old Infortrand box which had old PATA discs
> in
> > > it.
> > > This box is connected via scsi to a frontend server which exports the
> > > file space via iscsi. The backend for that, i.e. the one the user is
> > > accessing is
> > > on a different physical machine and it is a XEN guest. The reason
> behind
> > > that
> > > setting is as the frontend is acting as a backup server and I don't
> want
> > > people to have access to it.
> > > I then exchanged the Infortrend box with a more recent model which got
> > > SATA capeabilities but still got scsi connection to the frontend. The
> > > frontend is
> > > the same. I got a new controller for that box as the old one was
> broken.
> > > There is no changes in the backend, that is still the same XEN guest on
> > > the same hardware.
> > >
> > > What I cannot work out is why the old Infortrend box does not have any
> > > problems with the new file, the newer one has a problem here. Also,
> when
> > > I have
> > > copied over some files (again using the rsync command above) a few
> files
> > > did not
> > > copy correctly (again md5sum) in the first instance but done so later.
> > >
> > > I find that highly alarming as that means that at least for larger
> and/or
> > > some
> > > binary files there seems to be a problem. However, I am not sure there
> to
> > > look
> > > at it as I am out of ideas.
> > >
> > > Could it be there is a problem with the 'new' controller?
> > > In all cases I was using ext4 as a file system and I did not have any
> > > problems
> > > with that.
> > >
> > > Anybody got some sentiments here?
> > >
> > > All the best from a sunny London
> > >
> > > J?rg
> > >
> > > P.S. To make things worse I am off on a work related trip from Monday
> > > onwards
> > > and I am working on that problem since Friday evening.
> > >
> > >
> > >
> > > --
> > > *************************************************************
> > > Dr. J?rg Sa?mannshausen, MRSC
> > > University College London
> > > Department of Chemistry
> > > Gordon Street
> > > London
> > > WC1H 0AJ
> > >
> > > email: j.sassmannshausen at ucl.ac.uk <javascript:;>
> > > web: http://sassy.formativ.net
> > >
> > > Please avoid sending me Word or PowerPoint attachments.
> > > See http://www.gnu.org/philosophy/no-word-attachments.html
>
>
> --
> *************************************************************
> Dr. J?rg Sa?mannshausen, MRSC
> University College London
> Department of Chemistry
> Gordon Street
> London
> WC1H 0AJ
>
> email: j.sassmannshausen at ucl.ac.uk
> web: http://sassy.formativ.net
>
> Please avoid sending me Word or PowerPoint attachments.
> See http://www.gnu.org/philosophy/no-word-attachments.html
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20140921/c4d375c8/attachment-0001.html>

From j.sassmannshausen at ucl.ac.uk  Sun Sep 21 11:08:34 2014
From: j.sassmannshausen at ucl.ac.uk (=?iso-8859-15?q?J=F6rg_Sa=DFmannshausen?=)
Date: Sun, 21 Sep 2014 19:08:34 +0100
Subject: [Beowulf] strange problem with large file moving between server
In-Reply-To: <CAEiui-sdwCEeT14V+BK_VrQfKiiaadVCgM2X-0mvg_usSkRq=g@mail.gmail.com>
References: <201409211522.25727.j.sassmannshausen@ucl.ac.uk>
 <201409211817.47972.j.sassmannshausen@ucl.ac.uk>
 <CAEiui-sdwCEeT14V+BK_VrQfKiiaadVCgM2X-0mvg_usSkRq=g@mail.gmail.com>
Message-ID: <201409211908.35098.j.sassmannshausen@ucl.ac.uk>

Hi Andrew,

thanks.

I will look into that. It is good to hear it is ready for production now. The 
last time I looked into it it was not.

All the best

J?rg

On Sonntag 21 September 2014 you wrote:
> > Regarding ZFS: is that available for Linux now? I lost a bit track here.
> 
> Yes.
> 
> http://zfsonlinux.org/
> 
> I would say its ready for production now. Intel are about to start
> supporting it under Lustre in the next couple of months and they are
> typically careful about such things.
> 
> Cheers,
> 
> Andrew
> 
> > All the best from London
> > 
> > J?rg
> > 
> > On Sonntag 21 September 2014 you wrote:
> > > Hi J?rg,
> > > 
> > > Sounds like a "typical" but very uncommon silent data corruption
> > > problem. If you have another copy of the data, compare to that? If you
> > > don't have another copy, accept the fact that some of your data maybe
> > > got silently corrupted.
> > > 
> > > Most RAID controllers do periodic "scrubbing"; was your Infortrend
> > > doing that?
> > > 
> > > For the new system, consider using ZFS pointed at plain disks, as it
> > > may have more layers of checksums compared to your current system.
> > > 
> > > Regards,
> > > Alex
> > > 
> > > On Sunday, September 21, 2014, J?rg Sa?mannshausen <
> > > 
> > > j.sassmannshausen at ucl.ac.uk> wrote:
> > > > Dear all,
> > > > 
> > > > I got a rather strange problem with one of my file servers which I
> > > > recently have upgraded in order to accommodate more disc space.
> > > > 
> > > > The problem: I have copies the files from the old file space to a
> > > > temporary disc
> > > > storage space using this rsync command:
> > > > 
> > > > rsync -vrltH -pgo --stats -D --numeric-ids -x oldserver:foo
> > > > tempspace:baa
> > > > 
> > > > I am doing this now for some years and never had any problems.
> > > > 
> > > > As always, I am running md5sum afterwards to be sure ther is not a
> > > > problem later and the user is loosing data. This time around a rather
> > > > large file (around 16 GB) the md5sum failed after I moved the files
> > 
> > from
> > 
> > > > the temp space
> > > > back to the new destination using the same command as above.
> > > > 
> > > > Having still access to the old file space, I decided to move this
> > > > file from the
> > > > old file space. Strangely enough, rsync does not sync the file again
> > 
> > so I
> > 
> > > > had to
> > > > delete the file. Even after deleting the file and re-sync it from the
> > 
> > old
> > 
> > > > source, the md5sum is wrong.
> > > > 
> > > > Copying the file to a different file space did not cause these
> > > > problem, i.e. the
> > > > md5sum is correct.
> > > > As it is a tar.gz file, I simply decided to decompress the original
> > 
> > file
> > 
> > > > on the
> > > > different file server. That worked. The file where the md5sum is
> > > > wrong did not
> > > > decompress on the different file server but crashed with an error
> > 
> > message
> > 
> > > > when I
> > > > executed gunzip. So the file is broken.
> > > > 
> > > > The setup:
> > > > 
> > > > Originally I was using an old Infortrand box which had old PATA discs
> > 
> > in
> > 
> > > > it.
> > > > This box is connected via scsi to a frontend server which exports the
> > > > file space via iscsi. The backend for that, i.e. the one the user is
> > > > accessing is
> > > > on a different physical machine and it is a XEN guest. The reason
> > 
> > behind
> > 
> > > > that
> > > > setting is as the frontend is acting as a backup server and I don't
> > 
> > want
> > 
> > > > people to have access to it.
> > > > I then exchanged the Infortrend box with a more recent model which
> > > > got SATA capeabilities but still got scsi connection to the
> > > > frontend. The frontend is
> > > > the same. I got a new controller for that box as the old one was
> > 
> > broken.
> > 
> > > > There is no changes in the backend, that is still the same XEN guest
> > > > on the same hardware.
> > > > 
> > > > What I cannot work out is why the old Infortrend box does not have
> > > > any problems with the new file, the newer one has a problem here.
> > > > Also,
> > 
> > when
> > 
> > > > I have
> > > > copied over some files (again using the rsync command above) a few
> > 
> > files
> > 
> > > > did not
> > > > copy correctly (again md5sum) in the first instance but done so
> > > > later.
> > > > 
> > > > I find that highly alarming as that means that at least for larger
> > 
> > and/or
> > 
> > > > some
> > > > binary files there seems to be a problem. However, I am not sure
> > > > there
> > 
> > to
> > 
> > > > look
> > > > at it as I am out of ideas.
> > > > 
> > > > Could it be there is a problem with the 'new' controller?
> > > > In all cases I was using ext4 as a file system and I did not have any
> > > > problems
> > > > with that.
> > > > 
> > > > Anybody got some sentiments here?
> > > > 
> > > > All the best from a sunny London
> > > > 
> > > > J?rg
> > > > 
> > > > P.S. To make things worse I am off on a work related trip from Monday
> > > > onwards
> > > > and I am working on that problem since Friday evening.
> > > > 
> > > > 
> > > > 
> > > > --
> > > > *************************************************************
> > > > Dr. J?rg Sa?mannshausen, MRSC
> > > > University College London
> > > > Department of Chemistry
> > > > Gordon Street
> > > > London
> > > > WC1H 0AJ
> > > > 
> > > > email: j.sassmannshausen at ucl.ac.uk <javascript:;>
> > > > web: http://sassy.formativ.net
> > > > 
> > > > Please avoid sending me Word or PowerPoint attachments.
> > > > See http://www.gnu.org/philosophy/no-word-attachments.html
> > 
> > --
> > *************************************************************
> > Dr. J?rg Sa?mannshausen, MRSC
> > University College London
> > Department of Chemistry
> > Gordon Street
> > London
> > WC1H 0AJ
> > 
> > email: j.sassmannshausen at ucl.ac.uk
> > web: http://sassy.formativ.net
> > 
> > Please avoid sending me Word or PowerPoint attachments.
> > See http://www.gnu.org/philosophy/no-word-attachments.html
> > 
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> > To change your subscription (digest mode or unsubscribe) visit
> > http://www.beowulf.org/mailman/listinfo/beowulf


-- 
*************************************************************
Dr. J?rg Sa?mannshausen, MRSC
University College London
Department of Chemistry
Gordon Street
London
WC1H 0AJ 

email: j.sassmannshausen at ucl.ac.uk
web: http://sassy.formativ.net

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 230 bytes
Desc: This is a digitally signed message part.
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20140921/8f28c0cc/attachment.sig>

From stuartb at 4gh.net  Mon Sep 22 12:07:45 2014
From: stuartb at 4gh.net (Stuart Barkley)
Date: Mon, 22 Sep 2014 15:07:45 -0400 (EDT)
Subject: [Beowulf] NUMA zone_reclaim_mode considered harmful?
In-Reply-To: <541C7155.2020506@unimelb.edu.au>
References: <541C7155.2020506@unimelb.edu.au>
Message-ID: <alpine.BSF.2.11.1409221453340.45207@freeman.4gh.net>

Just to add a few more details to Chris' post with some references
which helped us...

We were seeing severe performance issues on our diskless systems with
an application doing mmap reads of large files on GPFS.  The I/O
pattern was sequential reads a large file.  The file was 5-10 times
the size of ram on the nodes.

We tracked this down to 'pgscand/s' in the 'sar -B' output going
outrageous (13M pages scanned per second to try to find a pages to
free).

Some googling led us to:

    <http://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases>

Although a fairly different problem this was just the information we
needed.

We found that /proc/sys/vm/zone_reclaim_mode was being set to 1 on our
systems despite various documentation indicating that the default
value should be 0.

As Chris noted the Linux kernel has recently accepted a patch claiming
to set zone_reclaim_mode to 0 (although the diff does not appear to do
it very directly).

It looks like setting zone_reclaim_mode to 0 was proposed at least as
early as 2009.  I'm unclear what happened with this patch:

    <http://osdir.com/ml/linux-kernel/2009-05/msg05670.html>

There is something from 2010 called "zone_reclaim_mode is the essence
of all evil":

    <http://www.poempelfox.de/blog/2010/03/19/>

This was very useful is pointing out Nehalem processor as being
particularly susceptible and suggesting 'numactl --hardware' to check
for the node distance.  Distance greater than 20 being the magic
number.

Stuart
-- 
I've never been lost; I was once bewildered for three days, but never lost!
                                        --  Daniel Boone