From prentice.bisbal at rutgers.edu Tue Sep 2 08:20:00 2014 From: prentice.bisbal at rutgers.edu (Prentice Bisbal) Date: Tue, 02 Sep 2014 11:20:00 -0400 Subject: [Beowulf] mpi slow pairs In-Reply-To: <5403B042.4050100@unimelb.edu.au> References: <88F9D072D5E6434BB9A49625A73D129269AB87@SRV-vEX2.viglen.co.uk> <20140829154957.GA6879@bx9.net> <5403B042.4050100@unimelb.edu.au> Message-ID: <5405E020.7020303@rutgers.edu> On 08/31/2014 07:31 PM, Christopher Samuel wrote: > On 30/08/14 01:49, Greg Lindahl wrote: > >> Huh, Intel (PathScale/QLogic) has shipped a NxN debugging program for >> more than a decade. The first vendor I recall shipping such a program >> was Microway. I guess it takes a while for good practices to spread >> throughout our community! > "The first rule of Infiniband debugging is nobody talks about Infiniband > debugging". > > Got a link for it please? > So true. I'd like to see a link, too. -- Prentice From prentice.bisbal at rutgers.edu Tue Sep 2 08:16:27 2014 From: prentice.bisbal at rutgers.edu (Prentice Bisbal) Date: Tue, 02 Sep 2014 11:16:27 -0400 Subject: [Beowulf] mpi slow pairs In-Reply-To: References: <88F9D072D5E6434BB9A49625A73D129269AB87@SRV-vEX2.viglen.co.uk> Message-ID: <5405DF4B.8020708@rutgers.edu> On 08/29/2014 11:30 AM, Michael Di Domenico wrote: > On Fri, Aug 29, 2014 at 9:32 AM, John Hearns wrote: >> I would say the usual tool for that pair-wise comparison is Intel IBM >> https://software.intel.com/en-us/articles/intel-mpi-benchmarks >> I hope I have got your requirement correct! > John, > > Close, but not exact. IMB will test ranks, but will not tell me if a > specific pair of ranks is slower then others, only the collective of > the ranks under test. what i'm looking for is an mpi version of this > > for x in node1->node100 > for y in node1->node100 > if x==y then skip > else mpirun -n 2 -npernode 1 -host $x,$y bwtest > $x$y.log > > unfortunately, the mpirun task takes about 3secs per iteration, and > with 10k iterations, it's going to take along time and i'm being > impatient. i've been trying to write the mpi code myself, but my mpi > is a little rusty so it's slow going... > >> Also have you run ibdiagnet to see if anything is flagged up? > i've run a multitude of ib diags on the machines, but nothing is > popping out as wrong. what's weird is that it's only certain pairing > of machines not any one machine in general. > I find most of the ibdiag* utilities to be of limited value when debugging IB issues. Unfortunately, Mellanox's Unified Fabric Manager (UFM) seems to be the only tool that's helpful for accurately monitoring and identifying issues with IB networks. I've never used UFM myself, but my friends at Princeton gave me a demo, and it's seems like a fantastic tool. Unfortunately, it's a commercial product, and probably only works on Mellanox hardware (you don't mention whether your using Qlogic or Mellanox hardware). The good news is, you can download it and evaluate it. I'd give that a try, if I were you. http://www.mellanox.com/page/products_dyn?product_family=100 -- Prentice From samuel at unimelb.edu.au Mon Sep 15 20:34:01 2014 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 16 Sep 2014 13:34:01 +1000 Subject: [Beowulf] Anyone else going to Slurm User Group in Lugano? Message-ID: <5417AFA9.9040902@unimelb.edu.au> Hi folks, Anyone else going to the Slurm User Group next week in Lugano in Switzerland? cheers, Chris -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci From samuel at unimelb.edu.au Fri Sep 19 11:09:25 2014 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Sat, 20 Sep 2014 04:09:25 +1000 Subject: [Beowulf] NUMA zone_reclaim_mode considered harmful? Message-ID: <541C7155.2020506@unimelb.edu.au> Hi folks, Over on the xCAT mailing list I've been involved in a thread relating to diverse settings of zone_reclaim_mode across nodes in clusters. It starts here with Stuarts good description of the problem and diagnosis: http://sourceforge.net/p/xcat/mailman/message/32841877/ I did some poking around on our systems and was able to confirm that whilst our newer iDatplex (dx360 M4's) with SB CPUs all had zone_reclaim_mode set to 0 our older iDataplex with Nehalems (dx360 M2's) were all 1, along with an older SGI UV10 (Westmere). The clincher was the fact that on that same cluster we had some IBM x3690 X5 with Maxx5's which boot an identical diskless image and they had zone_reclaim_mode set to 0, not 1. Turns out that this is indeed autotuned by older kernels, with this text in the kernel Documentation/sysctl/vm.txt # zone_reclaim_mode is set during bootup to 1 if it is determined # that pages from remote zones will cause a measurable performance # reduction. The page allocator will then reclaim easily reusable # pages (those page cache pages that are currently not used) before # allocating off node pages. However, in 3.16 a patch was committed that disabled this auto-tuning, turning off zone reclamation by default. It's probably worth checking your own x86-64 systems to see if this is set for you and benchmarking with it disabled if it is.. Here's that patch with the description.. commit 4f9b16a64753d0bb607454347036dc997fd03b82 Author: Mel Gorman Date: Wed Jun 4 16:07:14 2014 -0700 mm: disable zone_reclaim_mode by default When it was introduced, zone_reclaim_mode made sense as NUMA distances punished and workloads were generally partitioned to fit into a NUMA node. NUMA machines are now common but few of the workloads are NUMA-aware and it's routine to see major performance degradation due to zone_reclaim_mode being enabled but relatively few can identify the problem. Those that require zone_reclaim_mode are likely to be able to detect when it needs to be enabled and tune appropriately so lets have a sensible default for the bulk of users. This patch (of 2): zone_reclaim_mode causes processes to prefer reclaiming memory from local node instead of spilling over to other nodes. This made sense initially when NUMA machines were almost exclusively HPC and the workload was partitioned into nodes. The NUMA penalties were sufficiently high to justify reclaiming the memory. On current machines and workloads it is often the case that zone_reclaim_mode destroys performance but not all users know how to detect this. Favour the common case and disable it by default. Users that are sophisticated enough to know they need zone_reclaim_mode will detect it. Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Reviewed-by: Zhang Yanfei Acked-by: Michal Hocko Reviewed-by: Christoph Lameter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci From j.sassmannshausen at ucl.ac.uk Sun Sep 21 07:22:21 2014 From: j.sassmannshausen at ucl.ac.uk (=?iso-8859-15?q?J=F6rg_Sa=DFmannshausen?=) Date: Sun, 21 Sep 2014 15:22:21 +0100 Subject: [Beowulf] strange problem with large file moving between server Message-ID: <201409211522.25727.j.sassmannshausen@ucl.ac.uk> Dear all, I got a rather strange problem with one of my file servers which I recently have upgraded in order to accommodate more disc space. The problem: I have copies the files from the old file space to a temporary disc storage space using this rsync command: rsync -vrltH -pgo --stats -D --numeric-ids -x oldserver:foo tempspace:baa I am doing this now for some years and never had any problems. As always, I am running md5sum afterwards to be sure ther is not a problem later and the user is loosing data. This time around a rather large file (around 16 GB) the md5sum failed after I moved the files from the temp space back to the new destination using the same command as above. Having still access to the old file space, I decided to move this file from the old file space. Strangely enough, rsync does not sync the file again so I had to delete the file. Even after deleting the file and re-sync it from the old source, the md5sum is wrong. Copying the file to a different file space did not cause these problem, i.e. the md5sum is correct. As it is a tar.gz file, I simply decided to decompress the original file on the different file server. That worked. The file where the md5sum is wrong did not decompress on the different file server but crashed with an error message when I executed gunzip. So the file is broken. The setup: Originally I was using an old Infortrand box which had old PATA discs in it. This box is connected via scsi to a frontend server which exports the file space via iscsi. The backend for that, i.e. the one the user is accessing is on a different physical machine and it is a XEN guest. The reason behind that setting is as the frontend is acting as a backup server and I don't want people to have access to it. I then exchanged the Infortrend box with a more recent model which got SATA capeabilities but still got scsi connection to the frontend. The frontend is the same. I got a new controller for that box as the old one was broken. There is no changes in the backend, that is still the same XEN guest on the same hardware. What I cannot work out is why the old Infortrend box does not have any problems with the new file, the newer one has a problem here. Also, when I have copied over some files (again using the rsync command above) a few files did not copy correctly (again md5sum) in the first instance but done so later. I find that highly alarming as that means that at least for larger and/or some binary files there seems to be a problem. However, I am not sure there to look at it as I am out of ideas. Could it be there is a problem with the 'new' controller? In all cases I was using ext4 as a file system and I did not have any problems with that. Anybody got some sentiments here? All the best from a sunny London J?rg P.S. To make things worse I am off on a work related trip from Monday onwards and I am working on that problem since Friday evening. -- ************************************************************* Dr. J?rg Sa?mannshausen, MRSC University College London Department of Chemistry Gordon Street London WC1H 0AJ email: j.sassmannshausen at ucl.ac.uk web: http://sassy.formativ.net Please avoid sending me Word or PowerPoint attachments. See http://www.gnu.org/philosophy/no-word-attachments.html -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 230 bytes Desc: This is a digitally signed message part. URL: From alex.chekholko at gmail.com Sun Sep 21 10:00:20 2014 From: alex.chekholko at gmail.com (Alex Chekholko) Date: Sun, 21 Sep 2014 10:00:20 -0700 Subject: [Beowulf] strange problem with large file moving between server In-Reply-To: <201409211522.25727.j.sassmannshausen@ucl.ac.uk> References: <201409211522.25727.j.sassmannshausen@ucl.ac.uk> Message-ID: Hi J?rg, Sounds like a "typical" but very uncommon silent data corruption problem. If you have another copy of the data, compare to that? If you don't have another copy, accept the fact that some of your data maybe got silently corrupted. Most RAID controllers do periodic "scrubbing"; was your Infortrend doing that? For the new system, consider using ZFS pointed at plain disks, as it may have more layers of checksums compared to your current system. Regards, Alex On Sunday, September 21, 2014, J?rg Sa?mannshausen < j.sassmannshausen at ucl.ac.uk> wrote: > Dear all, > > I got a rather strange problem with one of my file servers which I recently > have upgraded in order to accommodate more disc space. > > The problem: I have copies the files from the old file space to a > temporary disc > storage space using this rsync command: > > rsync -vrltH -pgo --stats -D --numeric-ids -x oldserver:foo tempspace:baa > > I am doing this now for some years and never had any problems. > > As always, I am running md5sum afterwards to be sure ther is not a problem > later and the user is loosing data. This time around a rather large file > (around 16 GB) the md5sum failed after I moved the files from the temp > space > back to the new destination using the same command as above. > > Having still access to the old file space, I decided to move this file > from the > old file space. Strangely enough, rsync does not sync the file again so I > had to > delete the file. Even after deleting the file and re-sync it from the old > source, the md5sum is wrong. > > Copying the file to a different file space did not cause these problem, > i.e. the > md5sum is correct. > As it is a tar.gz file, I simply decided to decompress the original file > on the > different file server. That worked. The file where the md5sum is wrong did > not > decompress on the different file server but crashed with an error message > when I > executed gunzip. So the file is broken. > > The setup: > > Originally I was using an old Infortrand box which had old PATA discs in > it. > This box is connected via scsi to a frontend server which exports the file > space via iscsi. The backend for that, i.e. the one the user is accessing > is > on a different physical machine and it is a XEN guest. The reason behind > that > setting is as the frontend is acting as a backup server and I don't want > people to have access to it. > I then exchanged the Infortrend box with a more recent model which got SATA > capeabilities but still got scsi connection to the frontend. The frontend > is > the same. I got a new controller for that box as the old one was broken. > There is no changes in the backend, that is still the same XEN guest on the > same hardware. > > What I cannot work out is why the old Infortrend box does not have any > problems with the new file, the newer one has a problem here. Also, when I > have > copied over some files (again using the rsync command above) a few files > did not > copy correctly (again md5sum) in the first instance but done so later. > > I find that highly alarming as that means that at least for larger and/or > some > binary files there seems to be a problem. However, I am not sure there to > look > at it as I am out of ideas. > > Could it be there is a problem with the 'new' controller? > In all cases I was using ext4 as a file system and I did not have any > problems > with that. > > Anybody got some sentiments here? > > All the best from a sunny London > > J?rg > > P.S. To make things worse I am off on a work related trip from Monday > onwards > and I am working on that problem since Friday evening. > > > > -- > ************************************************************* > Dr. J?rg Sa?mannshausen, MRSC > University College London > Department of Chemistry > Gordon Street > London > WC1H 0AJ > > email: j.sassmannshausen at ucl.ac.uk > web: http://sassy.formativ.net > > Please avoid sending me Word or PowerPoint attachments. > See http://www.gnu.org/philosophy/no-word-attachments.html > -------------- next part -------------- An HTML attachment was scrubbed... URL: From j.sassmannshausen at ucl.ac.uk Sun Sep 21 10:17:43 2014 From: j.sassmannshausen at ucl.ac.uk (=?iso-8859-15?q?J=F6rg_Sa=DFmannshausen?=) Date: Sun, 21 Sep 2014 18:17:43 +0100 Subject: [Beowulf] strange problem with large file moving between server In-Reply-To: References: <201409211522.25727.j.sassmannshausen@ucl.ac.uk> Message-ID: <201409211817.47972.j.sassmannshausen@ucl.ac.uk> Hi Alex, thanks for the feedback. I still got the original data so that is not a problem right now. What worries me is even if I restore the data right now can I trust the system? It is a RAID5 I am using and the discs are new. I have formated the disc space on Thursday so the file system is new as wll. What I found on the front end is that in syslog: mptbase: ioc0: LogInfo(0x11080000): F/W: Outbound DMA Overrun And I get that a few times. So either the controller on the front end got a problem which I did not see with the older Infortrend box as it is slower and hence the controller is less active, or the controller at the Infortrend box got a problem. I don't know whether the Infortrend box does scrubbing. I have not activated something here and I am just using the standart settings. Regarding ZFS: is that available for Linux now? I lost a bit track here. All the best from London J?rg On Sonntag 21 September 2014 you wrote: > Hi J?rg, > > Sounds like a "typical" but very uncommon silent data corruption problem. > If you have another copy of the data, compare to that? If you don't have > another copy, accept the fact that some of your data maybe got silently > corrupted. > > Most RAID controllers do periodic "scrubbing"; was your Infortrend doing > that? > > For the new system, consider using ZFS pointed at plain disks, as it may > have more layers of checksums compared to your current system. > > Regards, > Alex > > On Sunday, September 21, 2014, J?rg Sa?mannshausen < > > j.sassmannshausen at ucl.ac.uk> wrote: > > Dear all, > > > > I got a rather strange problem with one of my file servers which I > > recently have upgraded in order to accommodate more disc space. > > > > The problem: I have copies the files from the old file space to a > > temporary disc > > storage space using this rsync command: > > > > rsync -vrltH -pgo --stats -D --numeric-ids -x oldserver:foo > > tempspace:baa > > > > I am doing this now for some years and never had any problems. > > > > As always, I am running md5sum afterwards to be sure ther is not a > > problem later and the user is loosing data. This time around a rather > > large file (around 16 GB) the md5sum failed after I moved the files from > > the temp space > > back to the new destination using the same command as above. > > > > Having still access to the old file space, I decided to move this file > > from the > > old file space. Strangely enough, rsync does not sync the file again so I > > had to > > delete the file. Even after deleting the file and re-sync it from the old > > source, the md5sum is wrong. > > > > Copying the file to a different file space did not cause these problem, > > i.e. the > > md5sum is correct. > > As it is a tar.gz file, I simply decided to decompress the original file > > on the > > different file server. That worked. The file where the md5sum is wrong > > did not > > decompress on the different file server but crashed with an error message > > when I > > executed gunzip. So the file is broken. > > > > The setup: > > > > Originally I was using an old Infortrand box which had old PATA discs in > > it. > > This box is connected via scsi to a frontend server which exports the > > file space via iscsi. The backend for that, i.e. the one the user is > > accessing is > > on a different physical machine and it is a XEN guest. The reason behind > > that > > setting is as the frontend is acting as a backup server and I don't want > > people to have access to it. > > I then exchanged the Infortrend box with a more recent model which got > > SATA capeabilities but still got scsi connection to the frontend. The > > frontend is > > the same. I got a new controller for that box as the old one was broken. > > There is no changes in the backend, that is still the same XEN guest on > > the same hardware. > > > > What I cannot work out is why the old Infortrend box does not have any > > problems with the new file, the newer one has a problem here. Also, when > > I have > > copied over some files (again using the rsync command above) a few files > > did not > > copy correctly (again md5sum) in the first instance but done so later. > > > > I find that highly alarming as that means that at least for larger and/or > > some > > binary files there seems to be a problem. However, I am not sure there to > > look > > at it as I am out of ideas. > > > > Could it be there is a problem with the 'new' controller? > > In all cases I was using ext4 as a file system and I did not have any > > problems > > with that. > > > > Anybody got some sentiments here? > > > > All the best from a sunny London > > > > J?rg > > > > P.S. To make things worse I am off on a work related trip from Monday > > onwards > > and I am working on that problem since Friday evening. > > > > > > > > -- > > ************************************************************* > > Dr. J?rg Sa?mannshausen, MRSC > > University College London > > Department of Chemistry > > Gordon Street > > London > > WC1H 0AJ > > > > email: j.sassmannshausen at ucl.ac.uk > > web: http://sassy.formativ.net > > > > Please avoid sending me Word or PowerPoint attachments. > > See http://www.gnu.org/philosophy/no-word-attachments.html -- ************************************************************* Dr. J?rg Sa?mannshausen, MRSC University College London Department of Chemistry Gordon Street London WC1H 0AJ email: j.sassmannshausen at ucl.ac.uk web: http://sassy.formativ.net Please avoid sending me Word or PowerPoint attachments. See http://www.gnu.org/philosophy/no-word-attachments.html -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 230 bytes Desc: This is a digitally signed message part. URL: From andrew.holway at gmail.com Sun Sep 21 10:59:00 2014 From: andrew.holway at gmail.com (Andrew Holway) Date: Sun, 21 Sep 2014 19:59:00 +0200 Subject: [Beowulf] strange problem with large file moving between server In-Reply-To: <201409211817.47972.j.sassmannshausen@ucl.ac.uk> References: <201409211522.25727.j.sassmannshausen@ucl.ac.uk> <201409211817.47972.j.sassmannshausen@ucl.ac.uk> Message-ID: > > Regarding ZFS: is that available for Linux now? I lost a bit track here. > Yes. http://zfsonlinux.org/ I would say its ready for production now. Intel are about to start supporting it under Lustre in the next couple of months and they are typically careful about such things. Cheers, Andrew > > All the best from London > > J?rg > > On Sonntag 21 September 2014 you wrote: > > Hi J?rg, > > > > Sounds like a "typical" but very uncommon silent data corruption problem. > > If you have another copy of the data, compare to that? If you don't have > > another copy, accept the fact that some of your data maybe got silently > > corrupted. > > > > Most RAID controllers do periodic "scrubbing"; was your Infortrend doing > > that? > > > > For the new system, consider using ZFS pointed at plain disks, as it may > > have more layers of checksums compared to your current system. > > > > Regards, > > Alex > > > > On Sunday, September 21, 2014, J?rg Sa?mannshausen < > > > > j.sassmannshausen at ucl.ac.uk> wrote: > > > Dear all, > > > > > > I got a rather strange problem with one of my file servers which I > > > recently have upgraded in order to accommodate more disc space. > > > > > > The problem: I have copies the files from the old file space to a > > > temporary disc > > > storage space using this rsync command: > > > > > > rsync -vrltH -pgo --stats -D --numeric-ids -x oldserver:foo > > > tempspace:baa > > > > > > I am doing this now for some years and never had any problems. > > > > > > As always, I am running md5sum afterwards to be sure ther is not a > > > problem later and the user is loosing data. This time around a rather > > > large file (around 16 GB) the md5sum failed after I moved the files > from > > > the temp space > > > back to the new destination using the same command as above. > > > > > > Having still access to the old file space, I decided to move this file > > > from the > > > old file space. Strangely enough, rsync does not sync the file again > so I > > > had to > > > delete the file. Even after deleting the file and re-sync it from the > old > > > source, the md5sum is wrong. > > > > > > Copying the file to a different file space did not cause these problem, > > > i.e. the > > > md5sum is correct. > > > As it is a tar.gz file, I simply decided to decompress the original > file > > > on the > > > different file server. That worked. The file where the md5sum is wrong > > > did not > > > decompress on the different file server but crashed with an error > message > > > when I > > > executed gunzip. So the file is broken. > > > > > > The setup: > > > > > > Originally I was using an old Infortrand box which had old PATA discs > in > > > it. > > > This box is connected via scsi to a frontend server which exports the > > > file space via iscsi. The backend for that, i.e. the one the user is > > > accessing is > > > on a different physical machine and it is a XEN guest. The reason > behind > > > that > > > setting is as the frontend is acting as a backup server and I don't > want > > > people to have access to it. > > > I then exchanged the Infortrend box with a more recent model which got > > > SATA capeabilities but still got scsi connection to the frontend. The > > > frontend is > > > the same. I got a new controller for that box as the old one was > broken. > > > There is no changes in the backend, that is still the same XEN guest on > > > the same hardware. > > > > > > What I cannot work out is why the old Infortrend box does not have any > > > problems with the new file, the newer one has a problem here. Also, > when > > > I have > > > copied over some files (again using the rsync command above) a few > files > > > did not > > > copy correctly (again md5sum) in the first instance but done so later. > > > > > > I find that highly alarming as that means that at least for larger > and/or > > > some > > > binary files there seems to be a problem. However, I am not sure there > to > > > look > > > at it as I am out of ideas. > > > > > > Could it be there is a problem with the 'new' controller? > > > In all cases I was using ext4 as a file system and I did not have any > > > problems > > > with that. > > > > > > Anybody got some sentiments here? > > > > > > All the best from a sunny London > > > > > > J?rg > > > > > > P.S. To make things worse I am off on a work related trip from Monday > > > onwards > > > and I am working on that problem since Friday evening. > > > > > > > > > > > > -- > > > ************************************************************* > > > Dr. J?rg Sa?mannshausen, MRSC > > > University College London > > > Department of Chemistry > > > Gordon Street > > > London > > > WC1H 0AJ > > > > > > email: j.sassmannshausen at ucl.ac.uk > > > web: http://sassy.formativ.net > > > > > > Please avoid sending me Word or PowerPoint attachments. > > > See http://www.gnu.org/philosophy/no-word-attachments.html > > > -- > ************************************************************* > Dr. J?rg Sa?mannshausen, MRSC > University College London > Department of Chemistry > Gordon Street > London > WC1H 0AJ > > email: j.sassmannshausen at ucl.ac.uk > web: http://sassy.formativ.net > > Please avoid sending me Word or PowerPoint attachments. > See http://www.gnu.org/philosophy/no-word-attachments.html > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From j.sassmannshausen at ucl.ac.uk Sun Sep 21 11:08:34 2014 From: j.sassmannshausen at ucl.ac.uk (=?iso-8859-15?q?J=F6rg_Sa=DFmannshausen?=) Date: Sun, 21 Sep 2014 19:08:34 +0100 Subject: [Beowulf] strange problem with large file moving between server In-Reply-To: References: <201409211522.25727.j.sassmannshausen@ucl.ac.uk> <201409211817.47972.j.sassmannshausen@ucl.ac.uk> Message-ID: <201409211908.35098.j.sassmannshausen@ucl.ac.uk> Hi Andrew, thanks. I will look into that. It is good to hear it is ready for production now. The last time I looked into it it was not. All the best J?rg On Sonntag 21 September 2014 you wrote: > > Regarding ZFS: is that available for Linux now? I lost a bit track here. > > Yes. > > http://zfsonlinux.org/ > > I would say its ready for production now. Intel are about to start > supporting it under Lustre in the next couple of months and they are > typically careful about such things. > > Cheers, > > Andrew > > > All the best from London > > > > J?rg > > > > On Sonntag 21 September 2014 you wrote: > > > Hi J?rg, > > > > > > Sounds like a "typical" but very uncommon silent data corruption > > > problem. If you have another copy of the data, compare to that? If you > > > don't have another copy, accept the fact that some of your data maybe > > > got silently corrupted. > > > > > > Most RAID controllers do periodic "scrubbing"; was your Infortrend > > > doing that? > > > > > > For the new system, consider using ZFS pointed at plain disks, as it > > > may have more layers of checksums compared to your current system. > > > > > > Regards, > > > Alex > > > > > > On Sunday, September 21, 2014, J?rg Sa?mannshausen < > > > > > > j.sassmannshausen at ucl.ac.uk> wrote: > > > > Dear all, > > > > > > > > I got a rather strange problem with one of my file servers which I > > > > recently have upgraded in order to accommodate more disc space. > > > > > > > > The problem: I have copies the files from the old file space to a > > > > temporary disc > > > > storage space using this rsync command: > > > > > > > > rsync -vrltH -pgo --stats -D --numeric-ids -x oldserver:foo > > > > tempspace:baa > > > > > > > > I am doing this now for some years and never had any problems. > > > > > > > > As always, I am running md5sum afterwards to be sure ther is not a > > > > problem later and the user is loosing data. This time around a rather > > > > large file (around 16 GB) the md5sum failed after I moved the files > > > > from > > > > > > the temp space > > > > back to the new destination using the same command as above. > > > > > > > > Having still access to the old file space, I decided to move this > > > > file from the > > > > old file space. Strangely enough, rsync does not sync the file again > > > > so I > > > > > > had to > > > > delete the file. Even after deleting the file and re-sync it from the > > > > old > > > > > > source, the md5sum is wrong. > > > > > > > > Copying the file to a different file space did not cause these > > > > problem, i.e. the > > > > md5sum is correct. > > > > As it is a tar.gz file, I simply decided to decompress the original > > > > file > > > > > > on the > > > > different file server. That worked. The file where the md5sum is > > > > wrong did not > > > > decompress on the different file server but crashed with an error > > > > message > > > > > > when I > > > > executed gunzip. So the file is broken. > > > > > > > > The setup: > > > > > > > > Originally I was using an old Infortrand box which had old PATA discs > > > > in > > > > > > it. > > > > This box is connected via scsi to a frontend server which exports the > > > > file space via iscsi. The backend for that, i.e. the one the user is > > > > accessing is > > > > on a different physical machine and it is a XEN guest. The reason > > > > behind > > > > > > that > > > > setting is as the frontend is acting as a backup server and I don't > > > > want > > > > > > people to have access to it. > > > > I then exchanged the Infortrend box with a more recent model which > > > > got SATA capeabilities but still got scsi connection to the > > > > frontend. The frontend is > > > > the same. I got a new controller for that box as the old one was > > > > broken. > > > > > > There is no changes in the backend, that is still the same XEN guest > > > > on the same hardware. > > > > > > > > What I cannot work out is why the old Infortrend box does not have > > > > any problems with the new file, the newer one has a problem here. > > > > Also, > > > > when > > > > > > I have > > > > copied over some files (again using the rsync command above) a few > > > > files > > > > > > did not > > > > copy correctly (again md5sum) in the first instance but done so > > > > later. > > > > > > > > I find that highly alarming as that means that at least for larger > > > > and/or > > > > > > some > > > > binary files there seems to be a problem. However, I am not sure > > > > there > > > > to > > > > > > look > > > > at it as I am out of ideas. > > > > > > > > Could it be there is a problem with the 'new' controller? > > > > In all cases I was using ext4 as a file system and I did not have any > > > > problems > > > > with that. > > > > > > > > Anybody got some sentiments here? > > > > > > > > All the best from a sunny London > > > > > > > > J?rg > > > > > > > > P.S. To make things worse I am off on a work related trip from Monday > > > > onwards > > > > and I am working on that problem since Friday evening. > > > > > > > > > > > > > > > > -- > > > > ************************************************************* > > > > Dr. J?rg Sa?mannshausen, MRSC > > > > University College London > > > > Department of Chemistry > > > > Gordon Street > > > > London > > > > WC1H 0AJ > > > > > > > > email: j.sassmannshausen at ucl.ac.uk > > > > web: http://sassy.formativ.net > > > > > > > > Please avoid sending me Word or PowerPoint attachments. > > > > See http://www.gnu.org/philosophy/no-word-attachments.html > > > > -- > > ************************************************************* > > Dr. J?rg Sa?mannshausen, MRSC > > University College London > > Department of Chemistry > > Gordon Street > > London > > WC1H 0AJ > > > > email: j.sassmannshausen at ucl.ac.uk > > web: http://sassy.formativ.net > > > > Please avoid sending me Word or PowerPoint attachments. > > See http://www.gnu.org/philosophy/no-word-attachments.html > > > > _______________________________________________ > > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > > To change your subscription (digest mode or unsubscribe) visit > > http://www.beowulf.org/mailman/listinfo/beowulf -- ************************************************************* Dr. J?rg Sa?mannshausen, MRSC University College London Department of Chemistry Gordon Street London WC1H 0AJ email: j.sassmannshausen at ucl.ac.uk web: http://sassy.formativ.net Please avoid sending me Word or PowerPoint attachments. See http://www.gnu.org/philosophy/no-word-attachments.html -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 230 bytes Desc: This is a digitally signed message part. URL: From stuartb at 4gh.net Mon Sep 22 12:07:45 2014 From: stuartb at 4gh.net (Stuart Barkley) Date: Mon, 22 Sep 2014 15:07:45 -0400 (EDT) Subject: [Beowulf] NUMA zone_reclaim_mode considered harmful? In-Reply-To: <541C7155.2020506@unimelb.edu.au> References: <541C7155.2020506@unimelb.edu.au> Message-ID: Just to add a few more details to Chris' post with some references which helped us... We were seeing severe performance issues on our diskless systems with an application doing mmap reads of large files on GPFS. The I/O pattern was sequential reads a large file. The file was 5-10 times the size of ram on the nodes. We tracked this down to 'pgscand/s' in the 'sar -B' output going outrageous (13M pages scanned per second to try to find a pages to free). Some googling led us to: Although a fairly different problem this was just the information we needed. We found that /proc/sys/vm/zone_reclaim_mode was being set to 1 on our systems despite various documentation indicating that the default value should be 0. As Chris noted the Linux kernel has recently accepted a patch claiming to set zone_reclaim_mode to 0 (although the diff does not appear to do it very directly). It looks like setting zone_reclaim_mode to 0 was proposed at least as early as 2009. I'm unclear what happened with this patch: There is something from 2010 called "zone_reclaim_mode is the essence of all evil": This was very useful is pointing out Nehalem processor as being particularly susceptible and suggesting 'numactl --hardware' to check for the node distance. Distance greater than 20 being the magic number. Stuart -- I've never been lost; I was once bewildered for three days, but never lost! -- Daniel Boone