From lindahl at pbm.com Sat Dec 1 15:15:31 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Sat Oct 11 01:06:44 2008 Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded libraries with MPI In-Reply-To: <6DB5B58A8E5AB846A7B3B3BFF1B4315A0184F6FD@AVEXCH1.qlogic.org> References: <6DB5B58A8E5AB846A7B3B3BFF1B4315A0184F6FD@AVEXCH1.qlogic.org> Message-ID: <20071201231531.GA4736@bx9.net> On Thu, Nov 29, 2007 at 11:26:45AM -0800, Tom Elken wrote: > The SPEC HPG (High Performance Group) is having discussions about using > a hybrid of MPI and thread-level parallelism on the SPEC MPI2007 > benchmark suite. I'd find it useful to debunk the notion that hybrid programming actually gives a speedup. That's probably not what HPG has in mind, but it'd be useful to the community. -- greg From quantummechanicsllc at msn.com Sat Dec 1 13:25:09 2007 From: quantummechanicsllc at msn.com (Donald Shillady) Date: Sat Oct 11 01:06:44 2008 Subject: [Beowulf] Really efficient MPIs?? In-Reply-To: References: <428810f20711272131r5cc3bb08w2431083b9cd85b97@mail.gmail.com> <474D6779.5010000@charter.net> Message-ID: Please pardon my naive questions but this is surely the place to get an expert answer. I am enthused by the recent micrWulf built by Prof. Adoms and his student at Calvin College. That device approached a "homogeneous" parallel system with all the same core frequencies and achieved over 26 GFLOPS for about $1200. With private funds and a curious 3 yo grandson I prefer to enclose the "System" into four PC cases and an external Ethernet switch. I also want to maintain the performance I now have with a 3 GHz Toshiba Pentium 4 laptop so I prefer an "asymmetric" system with a fast Master node with a lot of frills and three slower dual core satellite PCs. About ten years ago I was able to link an HP 9000/720 running HP-UX in one building with three other SGI nodes running IRIX in another building connected by TCP/IP Ethernet. Sadly I do not recall the name of that message passing system but the link was a pretty bad mismatch between the slower HP9000/720 and the faster SGI CPUs at that time but it was something with "Theoretical Chemists ......". Was that TCP/IP? Anyway I know it is possible to link CPU/cores with different speeds and different memory-bus speeds so my question is whether "Open MPI" can handle this situation? Specifically, suppose I set up: 1. a Master box with an AMD X2 5800+ overclocked to 3.0 GHz with DDR2 800 memory (at least 4GB, maybe 8GB), 1 300 GB SATA drive; there would also be other creature comfort frills on the Master box like CD R/W, floppy drive, graphics card etc. 2. three cheaper AMD X2 4000+ (2.1 GHz) and running cheaper DDR2 667 memory; bare bones, no drives just CPU, memory and gigE switch. 3. connected by a Trendware TEG-S80TXE 8-port Gigabit Ethernet switch with associated NIC switches. If all the CPU/nodes/cores were AMD X2 4000+ units this should be similar to the Calvin College microWulf and run at about 27 GFLOPS (LINPAK) due to the slightly faster 2.1 GHz AMD 4000+ CPUs compared to the microWulf AMD 3800+ 2.0 GHz units. I do not seek the ultimate (GFLOP/$) minimum, just an inexpensive system to run GAMESS for molecular calculations and a chance to learn about parallel software late in my career. So, can "Open MPI" handle different CPU/core frequencies and different memory bus frequencies over gigE. I note that the writer of GAMESS (Mike Schmidt) recommends TCP/IP for GAMESS rather than OPEN MPI and GAMESS is the overwhelming goal for my use but using UBUNTU I would like to be able to access the Internet as well from the Master box. While I have your attention, could you comment on whether Open MPI will run under LINSPIRE? I have messed around with LINSPIRE more than UBUNTU (although I have both source disks) and I like LINSPIRE because it looks more like WINDOWS. Summary: 1. Can Open MPI handle different clock speeds across several node/cores? 2. Can Open MPI handle different memory bus clock speeds across several node/cores? 3. Why not LINSPIRE instead of UBUNTU? Sorry about the dumb questions but I seem to recall that the Duke Beowulf managed to run using many different X86 PCs so what I want to do should be possible, but is Open MPI the best choice or what else? Don Shillady Emeritus Professor of Chemistry, VCU Ashland VA (working at home) Date: Wed, 28 Nov 2007 10:37:45 -0500From: peter.st.john@gmail.comTo: charliep@cs.earlham.eduSubject: Re: [Beowulf] Really efficient MPIs??CC: beowulf@beowulf.org For the sake of others as easily confused as myself, I note (now, thanks!) that OpenMP and OpenMPI are two different things: OpenMP (an alternative to the MPI method) is http://en.wikipedia.org/wiki/OpenMP OpenMPI (an implementation of MPI) is http://en.wikipedia.org/wiki/OpenMPI Cool. Peter On Nov 28, 2007 8:49 AM, Charlie Peck wrote: On Nov 28, 2007, at 8:04 AM, Jeffrey B. Layton wrote:> If you don't want to pay money for an MPI, then go with Open-MPI.> It too can run on various networks without recompiling. Plus it's > open-source.Unless you are using a gigabit ethernet, Open-MPI is noticeably lessefficient that LAM-MPI over that fabric.I suspect at some point in the future gige will catch-up but for now my (limited) understanding is that the Open-MPI folks are focusingtheir time on higher bandwidth/lower latency fabrics than gige.charlie _______________________________________________Beowulf mailing list, Beowulf@beowulf.orgTo change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20071201/e533a182/attachment.html From hahn at mcmaster.ca Sat Dec 1 17:54:21 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Sat Oct 11 01:06:44 2008 Subject: [Beowulf] Really efficient MPIs?? In-Reply-To: References: <428810f20711272131r5cc3bb08w2431083b9cd85b97@mail.gmail.com> <474D6779.5010000@charter.net> Message-ID: >with "Theoretical Chemists ......". Was that TCP/IP? Anyway I know it is >possible to link CPU/cores with different speeds and different memory-bus >speeds so my question is whether "Open MPI" can handle this situation? sure. nothing about MPI assumes that nodes are homogenous in speed, just that they can somehow get packets from sender to receiver. >Specifically, suppose I set up: the cluster you describe is basically a normal beowulf. > 1. Can Open MPI handle different clock speeds across several node/cores? of course. > 2. Can Open MPI handle different memory bus clock speeds across several node/cores? of course. MPI itself doesn't know or care about what cpu/mem are in the nodes, though individual applications may work best with homogenous nodes. (consider if the app has a periodic global collective operation such as broadcast or reduce. the work done between these collectives should ideally take the same amount of elapsed/wallclock time, or else some nodes will wind up waiting for the slower nodes.) > 3. Why not LINSPIRE instead of UBUNTU? it doesn't matter. an MPI application just wants basic OS functionality like a network stack, process scheduler, memory manager. distros differ only in desktop and config features - they all use pretty much the same kernel, so from MPI's perspective are nearly equivalent. > possible, but is Open MPI the best choice or what else? the MPI implementation won't have any effect on how well your application tolerates heterogeneity of nodes. regards, mark hahn. From gdjacobs at gmail.com Sat Dec 1 19:23:04 2007 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Sat Oct 11 01:06:44 2008 Subject: [Beowulf] Really efficient MPIs?? In-Reply-To: References: <428810f20711272131r5cc3bb08w2431083b9cd85b97@mail.gmail.com> <474D6779.5010000@charter.net> Message-ID: <47522518.8010206@gmail.com> Mark Hahn wrote: >> possible, but is Open MPI the best choice or what else? > > the MPI implementation won't have any effect on how well your application > tolerates heterogeneity of nodes. True within this context of a single binary executable image. Of course, few run totally heterogeneous nodes anymore. What is the new hybrid K10-Cell computer going to be using for interconnect? -- Geoffrey D. Jacobs From lindahl at pbm.com Sun Dec 2 15:17:00 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Sat Oct 11 01:06:44 2008 Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded libraries with MPI In-Reply-To: <320e992a0712020205r32f5eb4fld4d9272b1b958cc8@mail.gmail.com> References: <6DB5B58A8E5AB846A7B3B3BFF1B4315A0184F6FD@AVEXCH1.qlogic.org> <20071201231531.GA4736@bx9.net> <320e992a0712020205r32f5eb4fld4d9272b1b958cc8@mail.gmail.com> Message-ID: <20071202231700.GA22575@bx9.net> On Sun, Dec 02, 2007 at 12:05:50PM +0200, Eray Ozkural wrote: > I wouldn't be so sure! > > Sounds like a great match for clusters of multi-core architectures. People said the same thing when SMP became common on the low end. > And obviously many papers have been written about programming clusters > of SMP's so what exactly is your point here? The hybrid MPI/OpenMP emperor has no clothes. -- greg From nelsoneci at gmail.com Sat Dec 1 19:38:34 2007 From: nelsoneci at gmail.com (Nelson Castillo) Date: Sat Oct 11 01:06:44 2008 Subject: [Beowulf] Recommended paper for parallel sorting? Message-ID: <2accc2ff0712011938r58867701tc9988135f9edeb2a@mail.gmail.com> Hi. Could you please recommend a paper for reading? I'd like to know about parallel sorting algorithms for this architecture. Regards, Nelson.- -- http://arhuaco.org From examachine at gmail.com Sun Dec 2 02:05:50 2007 From: examachine at gmail.com (Eray Ozkural) Date: Sat Oct 11 01:06:44 2008 Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded libraries with MPI In-Reply-To: <20071201231531.GA4736@bx9.net> References: <6DB5B58A8E5AB846A7B3B3BFF1B4315A0184F6FD@AVEXCH1.qlogic.org> <20071201231531.GA4736@bx9.net> Message-ID: <320e992a0712020205r32f5eb4fld4d9272b1b958cc8@mail.gmail.com> On Dec 2, 2007 1:15 AM, Greg Lindahl wrote: > On Thu, Nov 29, 2007 at 11:26:45AM -0800, Tom Elken wrote: > > > The SPEC HPG (High Performance Group) is having discussions about using > > a hybrid of MPI and thread-level parallelism on the SPEC MPI2007 > > benchmark suite. > > I'd find it useful to debunk the notion that hybrid programming > actually gives a speedup. That's probably not what HPG has in mind, > but it'd be useful to the community. I wouldn't be so sure! Sounds like a great match for clusters of multi-core architectures. And obviously many papers have been written about programming clusters of SMP's so what exactly is your point here? Best, -- Eray Ozkural, PhD candidate. Comp. Sci. Dept., Bilkent University, Ankara http://www.cs.bilkent.edu.tr/~erayo Malfunct: http://myspace.com/malfunct ai-philosophy: http://groups.yahoo.com/group/ai-philosophy From toon.knapen at gmail.com Sun Dec 2 06:51:53 2007 From: toon.knapen at gmail.com (Toon Knapen) Date: Sat Oct 11 01:06:44 2008 Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded libraries with MPI In-Reply-To: References: <6DB5B58A8E5AB846A7B3B3BFF1B4315A0184F6FD@AVEXCH1.qlogic.org> <474FEF18.6020308@obs.unige.ch> Message-ID: <4752C689.5030102@gmail.com> Mark Hahn wrote: >> IMHO the hybris approach (MPI+threads) is interesting in case every >> MPI-process has lots of local data. > > yes. but does this happen a lot? the appealing case would be threads > that make lots of heavy use of some large data, _but_ > without needing synchronization/locking. once you need locking > among the threads, message passing starts to catch up. Direct solvers (for Finite Elements for instance) need a lot of data. Additionally distributing the matrix generate interfaces (between the different submatrices) which are hard to solve. In such situation, one tries to minimize the number of interfaces (by having one submatrix per MPI-process) and speed up the solving of each submatrix using threads. Finance is another example. Financial applications need to evaluate a large number of open positions based on the simulated, current or past market-data. There are many dependencies between all the different data which makes that it is hard to decompose the data in largely independent chunks. > >> latter is simpler because it only requires MPI-parallelism but if the >> code >> is memory-bound and every mpi-process has much of the same data, it >> will be >> better to share this common data with all processes on the same cpu >> and thus >> use threads intra-node. > > what kind of applications behave like that? I agree that if your MPI > app is keeping huge amounts of (static) data replicated in each rank, > you should rethink your design. > See above. From Hakon.Bugge at scali.com Mon Dec 3 01:11:51 2007 From: Hakon.Bugge at scali.com (=?iso-8859-1?Q?H=E5kon?= Bugge) Date: Sat Oct 11 01:06:44 2008 Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded libraries with MPI In-Reply-To: <200712022000.lB2K08cL014118@bluewest.scyld.com> References: <200712022000.lB2K08cL014118@bluewest.scyld.com> Message-ID: <20071203091158.9BBED35AD18@mail.scali.no> At Sat, 1 Dec 2007 15:15:31,Greg Lindahl wrote: > > The SPEC HPG (High Performance Group) is having discussions about using > > a hybrid of MPI and thread-level parallelism on the SPEC MPI2007 > > benchmark suite. > >I'd find it useful to debunk the notion that hybrid programming >actually gives a speedup. That's probably not what HPG has in mind, >but it'd be useful to the community. > >-- greg I have a slightly different view. Hybrid programming is used for performance reasons, but only in cases where parallelization (to the same level) is impossible/impractical using the pure MPI mode, or the parallelization yields low efficiency. So, if you're able to achieve your performance with MPI, you probably will. But there are cases where you cannot; a) the "decomposition parallel efficiency" is not good enough or b) the processes need a huge (shared) table. As to a), in the past I worked with a synthetic aperture radar application where I ended up with the hybrid model. The problem could only be decomposed in one dimension, and each process had 33% overhead. Obviously, the hybrid model was a good choice in this case. As to b), it might be more economic to size the memory on each node the the size of a single table and share it through shared memory. It is of course possible to share it from several MPI processes as well, but implementors might find their reason for using a hybrid model here. Relevance to the SPEC MPI2007? To my knowledge, the applications here do not have any of the constraints above, so I would be severely surprised if anyone uses the hybrid model on them. H?kon From rgb at phy.duke.edu Mon Dec 3 06:09:10 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Sat Oct 11 01:06:44 2008 Subject: [Beowulf] Recommended paper for parallel sorting? In-Reply-To: <2accc2ff0712011938r58867701tc9988135f9edeb2a@mail.gmail.com> References: <2accc2ff0712011938r58867701tc9988135f9edeb2a@mail.gmail.com> Message-ID: On Sat, 1 Dec 2007, Nelson Castillo wrote: > Hi. > > Could you please recommend a paper for reading? I'd like to know about parallel > sorting algorithms for this architecture. You might check out Ian Foster's free online book on parallel algorithms. It is worth buying if you're going to be doing a lot of parallel programming. Or there are two or three other decent textbooks on parallel programming at the algorithm level. I don't recall offhand if Foster covers sorting, but you can easily found out for free. Remember, GIYF here -- just enter search strings like "Foster Parallel Programming" to find his book, "Parallel Sorting Algorithms" or the like too see if there is anything out there on the web. rgb > > Regards, > Nelson.- > > -- Robert G. Brown Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone(cell): 1-919-280-8443 Web: http://www.phy.duke.edu/~rgb Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From peter.st.john at gmail.com Mon Dec 3 07:27:49 2007 From: peter.st.john at gmail.com (Peter St. John) Date: Sat Oct 11 01:06:44 2008 Subject: [Beowulf] Recommended paper for parallel sorting? In-Reply-To: References: <2accc2ff0712011938r58867701tc9988135f9edeb2a@mail.gmail.com> Message-ID: (re Ian Foster, *Designing and Building Parallel Programs *online as below or Addison Wesley): I did that search and right the top was this link, which looks like homebase for the original material: http://www-unix.mcs.anl.gov/dbpp/ Very cool, thanks RGB for what looks like toothsome book. Peter On Dec 3, 2007 9:09 AM, Robert G. Brown wrote: > On Sat, 1 Dec 2007, Nelson Castillo wrote: > > > Hi. > > > > Could you please recommend a paper for reading? I'd like to know about > parallel > > sorting algorithms for this architecture. > > You might check out Ian Foster's free online book on parallel > algorithms. It is worth buying if you're going to be doing a lot of > parallel programming. Or there are two or three other decent textbooks > on parallel programming at the algorithm level. I don't recall offhand > if Foster covers sorting, but you can easily found out for free. > > Remember, GIYF here -- just enter search strings like "Foster Parallel > Programming" to find his book, "Parallel Sorting Algorithms" or the like > too see if there is anything out there on the web. > > rgb > > > > > Regards, > > Nelson.- > > > > > > -- > Robert G. Brown > Duke University Dept. of Physics, Box 90305 > Durham, N.C. 27708-0305 > Phone(cell): 1-919-280-8443 > Web: http://www.phy.duke.edu/~rgb > Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20071203/f4b901b0/attachment.html From rgb at phy.duke.edu Mon Dec 3 10:38:12 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Sat Oct 11 01:06:44 2008 Subject: [Beowulf] Recommended paper for parallel sorting? In-Reply-To: References: <2accc2ff0712011938r58867701tc9988135f9edeb2a@mail.gmail.com> Message-ID: On Mon, 3 Dec 2007, Peter St. John wrote: > (re Ian Foster, *Designing and Building Parallel Programs *online as below > or Addison Wesley): > > I did that search and right the top was this link, which looks like homebase > for the original material: > http://www-unix.mcs.anl.gov/dbpp/ > Very cool, thanks RGB for what looks like toothsome book. I went ahead and bought a paper copy, but it is nice to be able to access the material from a workstation because I don't carry the copy around with me all the time...;-) rgb > Peter > > On Dec 3, 2007 9:09 AM, Robert G. Brown wrote: > >> On Sat, 1 Dec 2007, Nelson Castillo wrote: >> >>> Hi. >>> >>> Could you please recommend a paper for reading? I'd like to know about >> parallel >>> sorting algorithms for this architecture. >> >> You might check out Ian Foster's free online book on parallel >> algorithms. It is worth buying if you're going to be doing a lot of >> parallel programming. Or there are two or three other decent textbooks >> on parallel programming at the algorithm level. I don't recall offhand >> if Foster covers sorting, but you can easily found out for free. >> >> Remember, GIYF here -- just enter search strings like "Foster Parallel >> Programming" to find his book, "Parallel Sorting Algorithms" or the like >> too see if there is anything out there on the web. >> >> rgb >> >>> >>> Regards, >>> Nelson.- >>> >>> >> >> -- >> Robert G. Brown >> Duke University Dept. of Physics, Box 90305 >> Durham, N.C. 27708-0305 >> Phone(cell): 1-919-280-8443 >> Web: http://www.phy.duke.edu/~rgb >> Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > -- Robert G. Brown Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone(cell): 1-919-280-8443 Web: http://www.phy.duke.edu/~rgb Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From lindahl at pbm.com Mon Dec 3 12:55:45 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Sat Oct 11 01:06:44 2008 Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded libraries with MPI In-Reply-To: <20071203091158.9BBED35AD18@mail.scali.no> References: <200712022000.lB2K08cL014118@bluewest.scyld.com> <20071203091158.9BBED35AD18@mail.scali.no> Message-ID: <20071203205545.GA11220@bx9.net> On Mon, Dec 03, 2007 at 10:11:51AM +0100, H?kon Bugge wrote: > But > there are cases where you cannot; a) the > "decomposition parallel efficiency" is not good > enough or b) the processes need a huge (shared) table. You can accomplish (b) using a mmaped file, which is much easier than hybrid programming. I agree that (a) is theoretically useful, but I have only once seen a benchmark situation where (a) was the case. I have seen several situations where a hybrid code had a 1D MPI decomposition and "needed" OpenMP for more scaling, but could have been a pure MPI 2D or 3D code, with less complexity than the hybrid code. -- greg From richard.walsh at comcast.net Mon Dec 3 13:47:41 2007 From: richard.walsh at comcast.net (richard.walsh@comcast.net) Date: Sat Oct 11 01:06:44 2008 Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded libraries with MPI Message-ID: <120320072147.28375.4754797D000678BC00006ED72200748184089C040E99D20B9D0E080C079D@comcast.net> -------------- Original message -------------- From: Håkon Bugge > I have a slightly different view. Hybrid > programming is used for performance reasons, but > only in cases where parallelization (to the same > level) is impossible/impractical using the pure > MPI mode, or the parallelization yields low > efficiency. So, if you're able to achieve your > performance with MPI, you probably will. But > there are cases where you cannot; a) the > "decomposition parallel efficiency" is not good > enough or b) the processes need a huge (shared) table. I think that what is being said here is that applications may be decomposible in some number of dimensions, but not so in all. If the benefits in performance in locally managing the "unruly" dimensions are great enough, then a hybrid program may be worth the trouble. I think that the number of real-world apps in this class is perhaps not large, or there would be more hybrid code. Another perhaps relavent alternative that will at some point be able to take on both the partionable and unpartionable extreme cases and everything in between are the PGAS language extensions (UPC and CAF). Not yet at distributed-memory, performance-parity with well-coded MPI, but with, arguably, an intrinsic programmability advantage in LOC and in data structure coverage. AMR codes tracking shedding vortices are inherently non-partionable (or in need of regular repartitioning). Managing then in either MPI or OpenMP in a distributed memory environment is a chore. And if you believe that ... ;-) ... then there is of course the "magic" of many-threaded latency hiding (can't say I am a true believer for the data intensive OZ of HPC). Some would have you believe that a 32 thread, 8 core Niagara 2 (or perhaps a future design at some higher active thread to core ratio) can hide all your data latency events behind its active thread horizon. Maybe the key is to combine PGAS with many-threads ... mmm ... anyone doing this? ;-) rbw -- "Making predictions is hard, especially about the future." Niels Bohr -- Richard Walsh Thrashing River Consulting-- 5605 Alameda St. Shoreview, MN 55126 Phone #: 612-382-4620 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20071203/a286e39b/attachment.html From lindahl at pbm.com Mon Dec 3 13:57:53 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Sat Oct 11 01:06:44 2008 Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded libraries with MPI In-Reply-To: <120320072147.28375.4754797D000678BC00006ED72200748184089C040E99D20B9D0E080C079D@comcast.net> References: <120320072147.28375.4754797D000678BC00006ED72200748184089C040E99D20B9D0E080C079D@comcast.net> Message-ID: <20071203215752.GB6727@bx9.net> On Mon, Dec 03, 2007 at 09:47:41PM +0000, richard.walsh@comcast.net wrote: > I think that the number of real-world apps in this class is perhaps > not large, or there would be more hybrid code. Ah, but you've missed the random element here: People start writing hybrid code before they have any proof that it helps them. Or they don't write it at all because they know it's complicated. Either way, you can't assume cause and effect of "hybrid helps me" and "my code is hybrid". -- greg From gerry.creager at tamu.edu Mon Dec 3 14:18:46 2007 From: gerry.creager at tamu.edu (Gerry Creager) Date: Sat Oct 11 01:06:44 2008 Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded libraries with MPI In-Reply-To: <20071203215752.GB6727@bx9.net> References: <120320072147.28375.4754797D000678BC00006ED72200748184089C040E99D20B9D0E080C079D@comcast.net> <20071203215752.GB6727@bx9.net> Message-ID: <475480C6.1070309@tamu.edu> Greg Lindahl wrote: > On Mon, Dec 03, 2007 at 09:47:41PM +0000, richard.walsh@comcast.net wrote: > >> I think that the number of real-world apps in this class is perhaps >> not large, or there would be more hybrid code. > > Ah, but you've missed the random element here: People start writing > hybrid code before they have any proof that it helps them. Or they > don't write it at all because they know it's complicated. Either way, > you can't assume cause and effect of "hybrid helps me" and "my code > is hybrid". Or their code turns out to be 'hybrid' because they didn't really know what they were writing... gerry -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From richard.walsh at comcast.net Mon Dec 3 14:29:56 2007 From: richard.walsh at comcast.net (richard.walsh@comcast.net) Date: Sat Oct 11 01:06:44 2008 Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded libraries with MPI Message-ID: <120320072229.3467.475483640006184A00000D8B2200748184089C040E99D20B9D0E080C079D@comcast.net> -------------- Original message -------------- From: Greg Lindahl > On Mon, Dec 03, 2007 at 09:47:41PM +0000, richard.walsh@comcast.net wrote: > > > I think that the number of real-world apps in this class is perhaps > > not large, or there would be more hybrid code. > > Ah, but you've missed the random element here: People start writing > hybrid code before they have any proof that it helps them. Or they > don't write it at all because they know it's complicated. Either way, > you can't assume cause and effect of "hybrid helps me" and "my code > is hybrid". True, enough ... one must consider both the kinetic and thermodynamic requirements for existence, but I was thinking that the system was perhaps at equilibrium by now. Still, it was careless of me to use non-existence to argue for either the absense of cause or presence of impossibility. I am still waiting to get a straight flush in 5-card draw. ;-) rbw -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20071203/ee242292/attachment.html From lindahl at pbm.com Mon Dec 3 15:23:44 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Sat Oct 11 01:06:44 2008 Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded libraries with MPI In-Reply-To: <120320072229.3467.475483640006184A00000D8B2200748184089C040E99D20B9D0E080C079D@comcast.net> References: <120320072229.3467.475483640006184A00000D8B2200748184089C040E99D20B9D0E080C079D@comcast.net> Message-ID: <20071203232343.GA27291@bx9.net> On Mon, Dec 03, 2007 at 10:29:56PM +0000, richard.walsh@comcast.net wrote: > True, enough ... one must consider both the kinetic and > thermodynamic requirements for existence, but I was thinking that the > system was perhaps at equilibrium by now. No, people keep on producing hybrid codes and finding that they aren't any faster than pure MPI. It's an amazing waste of money. > I am still waiting to get a straight flush in 5-card draw. Riiiight. -- greg From gmichal at uow.edu.au Mon Dec 3 15:20:26 2007 From: gmichal at uow.edu.au (Guillaume MICHAL) Date: Sat Oct 11 01:06:44 2008 Subject: [Beowulf] A cluster for material simulation Message-ID: Good morning all, Our faculty is thinking about a cluster for material simulations. At the moment we would like to use FEM, MD, MPM and maybe in some cases a multiscale FEM/MD or MPM/MD. We will start with a very small cluster around 5 nodes to be familliar with this kind of system and then extend it to around 20 nodes. Tasks size could vary between 1G to let say 10G. FEM will use Abaqus or CODE_ASTER. I don't really know the name of the softwares for MPM and MD. I did some reasearch and reading (by the way, Building clustered linux systems by Robert W.Lucke is a bit scary!) and defined 2 kind of systems. I'd like your opinion on these. Both systems use a Gigabit ethernet, 2GB of memory per CPU, 80GB of sata hard drive per nodes. Dekstop motherboard based system: 1 asus P5E WS Professional motherboard, 1066FSB, DDR2 800 NON ECC unbuffered, 2GigE ports, 1 Intel Q6600 CPU @2.4GHz 8MB L2 cache Server motherboard based system: Supermicroserver 6015C-MTB 1333/1066FSB, DDR2 667 ECC FB-DIMM, 2GigE ports, 2 intel Xeon 5410 CPU @2.3Ghz 12MB L2 cache It might seems I'm comparing apples and oranges but theoretical peak performance is equivalent and in term of cost/CPU there is not a huge difference(150 to 250 A$), also the server solution use twice as less nodes wich could be interesting in term of space, cables, switch... For recycling the desktop option seems better except if we use the servers for some kind of graphic cluster in the futur. Now the real questions: 1- If I understood properly FEM is kind of memory bounded so DDR2 800/1066FSB/8MB L2 cache or DDR2 667/1333FSB/12MB L2 cache -> kind of newbie to theses things! 2- Which one seems better in term of performance, reliability? 3- Do I need a distinct network for NFS sharing (thath's why I wanted 2 GigE ports per nodes) or I put the shared data on the master node(Quote from R.W.Luke book: "This is bad, bad, bad")? 4- there is also the supermicro superserver 6015tw-tb with two dual socket motherboard in a 1U form factor (node it's just two nodes put in one box, no interconnections whatsoever apart from the PSU) with roughtly the same price per CPU compare to the other supermicro solution, could be interesting for an even more compact system, do you have any knowledge about this system? 5- anything I didn't think of and might be worth checking such as "Oh! you need a fast hard drive as i/o is critical...;-)" Thank you for your advices! Guillaume Michal -- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/ From Michael.Frese at NumerEx.com Mon Dec 3 16:39:45 2007 From: Michael.Frese at NumerEx.com (Michael H. Frese) Date: Sat Oct 11 01:06:44 2008 Subject: [Beowulf] NFS Read Errors Message-ID: <6.2.5.6.2.20071128132559.04fcbba8@NumerEx.com> We were having trouble restarting from our homegrown parallel magnetohydrodynamic code's checkpoint files. The files could be read, but funny things happened in the run afterward. Eventually we figured out that the restarted parallel run differed from the serial restarted run from the same checkpoint. After much gnashing of teeth and rending of apparel, we found that the checkpoint files were being read incorrectly across NFS. That let us simplify our search for the problem. We first found that the local md5 digest [openssl dgst -md5 (file...)] on an NFS cp'ed version of the file was different from that produced on the original file. What was interesting was that the copy either took forEVER -- like 10 minutes or 20 minutes for a 1 GB file -- when the final result was bad or it took about a minute when the file was perfect. I'm guessing that whatever error checking that gets done on the packets was rejecting so many it finally got a bad packet it couldn't tell was bad. When we found that doing the md5 digest on a remote file produced a different result than doing it on the processor on which the disk was mounted, our tests got simpler. And shorter, still, after we found that we could get fairly frequent failures with 10 MB files or smaller. Clearly we had an NFS failure, probably associated with hardware. This was all between two specific nodes of our small cluster. [Old hardware generally: AMD Athlon 32-bit single (MSI KT4V) and dual (Tyan...) chip motherboards both running Redhat 9 one with the 2.4.20-8 kernels, though one is the smp version; NetGear GA311 NICs; and a NetGear GS108 8 port Copper 1 GB/s switch. The single processor motherboards have 32-bit PCI slots so their network speeds are limited to 300 kbps as shown by netpipe. All of the LEDs at the ends of the cables show 1000Mb connections.] Then we started checking other pairs. Some were fine. Some were bad in the same way. So we replaced the switch, changing to a 16 port NetGear GS216. That seemed to cure most of the problem. But we continued to have problems copying a file on one particular single processor machine from the others. That's where we are now. The md5 digest run on that machine consistently shows the same result, whereas the digest for that file produced on a remote machine will be almost stochastic. In some cases it will eventually settle in to the right answer, and then the speed goes WAY up. I suppose that happens because the file request can be served from the local machine's cache. But why doesn't it happen after it received bad blocks? Most, if not all of the original network cards in those machines went bad and have been replaced in the last few years, so I decided to try a brand new GA311. No joy there. It still gives out the wrong info. I guess the motherboard PCI bus controller is hinky, but I'm far from sure. We are in the process of upgrading and thus replacing all the machines we have of that configuration due to space limitations and their age, but I'm still curious what the problem could be. Suggestions? Comments? Mike From landman at scalableinformatics.com Mon Dec 3 17:21:37 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Sat Oct 11 01:06:44 2008 Subject: [Beowulf] NFS Read Errors In-Reply-To: <6.2.5.6.2.20071128132559.04fcbba8@NumerEx.com> References: <6.2.5.6.2.20071128132559.04fcbba8@NumerEx.com> Message-ID: <4754ABA1.9030105@scalableinformatics.com> Hi Michael: Michael H. Frese wrote: > We were having trouble restarting from our homegrown parallel > magnetohydrodynamic code's checkpoint files. The files could be read, > but funny things happened in the run afterward. Eventually we figured > out that the restarted parallel run differed from the serial restarted > run from the same checkpoint. > > After much gnashing of teeth and rending of apparel, we found that the > checkpoint files were being read incorrectly across NFS. That let us > simplify our search for the problem. We first found that the local md5 > digest [openssl dgst -md5 (file...)] on an NFS cp'ed version of the file md5sum filename does the same thing with a slightly simpler syntax. There is mounting evidence that you should use sha1sum rather than md5sum. > was different from that produced on the original file. What was > interesting was that the copy either took forEVER -- like 10 minutes or > 20 minutes for a 1 GB file -- when the final result was bad or it took > about a minute when the file was perfect. I'm guessing that whatever > error checking that gets done on the packets was rejecting so many it > finally got a bad packet it couldn't tell was bad. Sounds a great deal like a bad disk/disk system or something mucking with your connection to the data. 1 GB file, even at 1 MB/s is 1000 seconds, or 16 minutes. If you have a disk which keeps timing out, or has bad blocks, and keeps retrying, well, stuff like this can happen, especially on old kernels (and old hardware). Could also be a RAM error. > > When we found that doing the md5 digest on a remote file produced a > different result than doing it on the processor on which the disk was > mounted, our tests got simpler. And shorter, still, after we found that > we could get fairly frequent failures with 10 MB files or smaller. > Clearly we had an NFS failure, probably associated with hardware. Yes. I would venture a guess that you are seeing *lots* of errors in your /var/log/syslog or /var/log/messages files. > This was all between two specific nodes of our small cluster. [Old > hardware generally: AMD Athlon 32-bit single (MSI KT4V) and dual > (Tyan...) chip motherboards both running Redhat 9 one with the 2.4.20-8 > kernels, though one is the smp version; NetGear GA311 NICs; and a Owie... > NetGear GS108 8 port Copper 1 GB/s switch. The single processor > motherboards have 32-bit PCI slots so their network speeds are limited > to 300 kbps as shown by netpipe. All of the LEDs at the ends of the > cables show 1000Mb connections.] 300 kbps? thats 300 kilo bits per second (abbreviations are *very* important to get right, kB/s is not the same as kb/s). 300 kbps is usually read as 300 kilo bits per second. Or about about 37.5 kB/s. Which is about the average speed of various DSL lines. I hope you mean 30 MB/s (or 240 Mb/s). > > Then we started checking other pairs. Some were fine. Some were bad in > the same way. So we replaced the switch, changing to a 16 port NetGear > GS216. That seemed to cure most of the problem. But we continued to We have seen bad switches a few times. > have problems copying a file on one particular single processor machine > from the others. > > That's where we are now. The md5 digest run on that machine > consistently shows the same result, whereas the digest for that file > produced on a remote machine will be almost stochastic. In some cases > it will eventually settle in to the right answer, and then the speed > goes WAY up. I suppose that happens because the file request can be > served from the local machine's cache. But why doesn't it happen after > it received bad blocks? I am guessing you are using TCP NFS mounts as well? TCP forces retries in the event of bad packets. UDP doesn't force this, but the NFS protocol will try. Ram errors, bad cables, burnt switches, and machines with interrupt problems (old machines often shared interrupts without being able to do a very good job of it). > Most, if not all of the original network cards in those machines went > bad and have been replaced in the last few years, so I decided to try a > brand new GA311. No joy there. It still gives out the wrong info. I > guess the motherboard PCI bus controller is hinky, but I'm far from sure. Did you try a new cable? Had a few cables go bad, usually they are marginal to begin with. > > We are in the process of upgrading and thus replacing all the machines > we have of that configuration due to space limitations and their age, > but I'm still curious what the problem could be. There are quite a few possibilities unfortunately. Unless you plan to use these existing machines for quite a while longer, it might be less painful to shut off the malfunctioning node. > > Suggestions? Comments? 2.4.20? Athlons? I would say a serious hardware/OS refresh is in order :) -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From hahn at mcmaster.ca Mon Dec 3 22:15:00 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Sat Oct 11 01:06:44 2008 Subject: [Beowulf] Re: CSharifi Next generation of HPC In-Reply-To: References: <6DB5B58A8E5AB846A7B3B3BFF1B4315A0184F6FD@AVEXCH1.qlogic.org> <474FEF18.6020308@obs.unige.ch> <4752C689.5030102@gmail.com> Message-ID: > C-Sharifi Cluster Engine: The Second Success Story on "Kernel-Level OK, how about providing some meaty content? google shows me that you've put this fairly content-light PR on several groups and websites. > as Usability. Although the latter belief was hard to realize, a sample why was it hard? there have been a fair number of several kernel-based dist-OS approaches (MOSIX comes to mind, but scyld, and also a host of older academic systems.) > byproduct called DIPC was built purely based on this thesis and openly > announced to the Linux community worldwide in 1993. This was admired for > being able to provide necessary supports for distributed communication at > the Kernel Level of Linux for the first time in the world, and for providing page-based distributed shared memory has been done many, many times, and operate in a very easy-to-understand manner (like a cache with 4KB rather than 64KB lines.) can you quantify the advantage to managing the DSM in the kernel? I'm sure you're aware that "playing MMU games" is not highly regarded in many circles because of its slowness - have you figured out a way around that? regards, mark hahn. From hahn at mcmaster.ca Mon Dec 3 22:31:05 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Sat Oct 11 01:06:44 2008 Subject: [Beowulf] NFS Read Errors In-Reply-To: <4754ABA1.9030105@scalableinformatics.com> References: <6.2.5.6.2.20071128132559.04fcbba8@NumerEx.com> <4754ABA1.9030105@scalableinformatics.com> Message-ID: > does the same thing with a slightly simpler syntax. There is mounting > evidence that you should use sha1sum rather than md5sum. for general checking, md5 is still fine (ie not security-related stuff). > I am guessing you are using TCP NFS mounts as well? TCP forces retries in > the event of bad packets. UDP doesn't force this, but the NFS protocol will UDP has a checksum as well, though it's only 16b. then again, the TCP checksum isn't all that strong for today's data rates either. you should definitely examine /proc/net/dev on involved machines. >> We are in the process of upgrading and thus replacing all the machines we >> have of that configuration due to space limitations and their age, but I'm >> still curious what the problem could be. I would attempt to reduce the complexity of your testing. for instance, can a node write and verify to its local disk without problem? can it stream data over tcp sockets (netcat or the like) without corruption or obvious problems reflected in /proc/net/dev? does ethtool tell you anything about the config of the nic? comparing tcp vs udp NFS would be sensible as well - varying the packet size, too. switching client and/or server to a modern 2.6 kernel may be instructive. From hahn at mcmaster.ca Mon Dec 3 23:00:19 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Sat Oct 11 01:06:44 2008 Subject: [Beowulf] A cluster for material simulation In-Reply-To: References: Message-ID: > be familliar with this kind of system and then extend it to around 20 nodes. > Tasks size could vary between 1G to let say 10G. 10G is quite modest, especially for 20 nodes (ram is cheap!). are you sure you need a cluster? a single nicely configured SMP system will handle 10G jobs quite neatly, and save considerable effort. of course, you can't really scale memory bandwidth without going to a cluster, but I would guess that a 4-socket, quad-core AMD system with all memory banks active would be tempting. > I did some reasearch and reading (by the way, Building clustered linux > systems by Robert W.Lucke is a bit scary!) well, it tries to cover a lot of ground. it's really pretty simple to get a basic cluster up and running. > Both systems use a Gigabit ethernet, 2GB of memory per CPU, 80GB of sata hard > drive per nodes. > Dekstop motherboard based system: > 1 asus P5E WS Professional motherboard, 1066FSB, DDR2 800 NON ECC unbuffered, > 2GigE ports, 1 Intel Q6600 CPU @2.4GHz 8MB L2 cache > > Server motherboard based system: > Supermicroserver 6015C-MTB 1333/1066FSB, DDR2 667 ECC FB-DIMM, 2GigE ports, 2 > intel Xeon 5410 CPU @2.3Ghz 12MB L2 cache the main thing here is that Intel has, for a long time, had a mediocre reputation for memory bandwidth. I probably would not consider buying anything older than the 45nm penryn-generation chips with 1333 or higher FSB. > It might seems I'm comparing apples and oranges but theoretical peak > performance is equivalent and in term of cost/CPU there is not a huge > difference(150 to 250 A$), also the server solution use twice as less nodes > wich could be interesting in term of space, cables, switch... a 20-node cluster is half a rack, and not really complicated in cabling. how's your cooling? I'd probably worry about cooling before I worried about cabling... > For recycling > the desktop option seems better except if we use the servers for some kind of > graphic cluster in the futur. perhaps. my experience is that well-adapted cluster nodes are not good for desktops precisely because of those adaptations. > 1- If I understood properly FEM is kind of memory bounded so DDR2 > 800/1066FSB/8MB L2 cache or DDR2 667/1333FSB/12MB L2 cache -> kind of newbie > to theses things! 10G/20 nodes is 512M/node - divided among 4 cores is 128M/core, so I suspect the cache size isn't going to make much difference. the FSB will matter, though. > 2- Which one seems better in term of performance, reliability? faster FSB and ram will be noticably better in performance. I don't see why there would be much difference in reliability, though. the parts that break are mainly fans. server parts tend to offer nicer monitoring options as well as the comfort of ECC (one less place for a heisenbug to live.) > 3- Do I need a distinct network for NFS sharing (thath's why I wanted certainly not. my experience is that a single job doesn't tend to overlap its MPI and NFS traffic much. if you share a single node among multiple jobs, this could be an issue. > 2 GigE ports per nodes) or I put the shared data on the master node(Quote > from R.W.Luke book: "This is bad, bad, bad")? well, he's wrong. sure, it's a hotspot, but it's also convenient, cheap and effective. going to a parallel filesystem will be a significant increase in complexity, though only you can know how badly you need the IO performance. a shared fileserver can deliver higher bandwidth through trunking or even a 10Gb link. configuring a couple fileservers obviously scales nicely at the expense of having a partitioned namespace. > 4- there is also the supermicro superserver 6015tw-tb with two dual > socket motherboard in a 1U form factor (node it's just two nodes put in one > box, no interconnections whatsoever apart from the PSU) with roughtly the > same price per CPU compare to the other supermicro solution, could be > interesting for an even more compact system, do you have any knowledge about > this system? AFAIK, the only downside is a custom formfactor (chassis, boards, PSU). but why is space such an issue for you? a stack of 20 1U servers is not all that big. it's also a newer system design which, given low-volt cpus, would be nicely heat-efficient. > 5- anything I didn't think of and might be worth checking such as > "Oh! you need a fast hard drive as i/o is critical...;-)" your IO will be over gigabit, so you don't need fast HD (current single disks average about 70 MB/s. even for a 20-node cluster, I'd seriously consider getting IPMI or at least controllable power. From rgb at phy.duke.edu Tue Dec 4 04:53:10 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Sat Oct 11 01:06:44 2008 Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded libraries with MPI In-Reply-To: <120320072229.3467.475483640006184A00000D8B2200748184089C040E99D20B9D0E080C079D@comcast.net> References: <120320072229.3467.475483640006184A00000D8B2200748184089C040E99D20B9D0E080C079D@comcast.net> Message-ID: On Mon, 3 Dec 2007, richard.walsh@comcast.net wrote: > impossibility. I am still waiting to get a straight flush in 5-card > draw. Are ye, now... interesting. Sometime we'll have to wait together. In the meantime, I find that if you play the game with a wild card or eight it alters the odds magnificently. Why, you can get a straight flush and still lose the game...;-) rgb (Who's lurking but busy and who never, ever writes hybrid code. Sounds positively -- um -- sexual. Or radioactive. Involving white coated men with large ears and thick glasses. Not for me.) -- Robert G. Brown Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone(cell): 1-919-280-8443 Web: http://www.phy.duke.edu/~rgb Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From larry.stewart at sicortex.com Tue Dec 4 05:46:35 2007 From: larry.stewart at sicortex.com (Larry Stewart) Date: Sat Oct 11 01:06:44 2008 Subject: [Beowulf] Recommended paper for parallel sorting? In-Reply-To: <2accc2ff0712011938r58867701tc9988135f9edeb2a@mail.gmail.com> References: <2accc2ff0712011938r58867701tc9988135f9edeb2a@mail.gmail.com> Message-ID: <47555A3B.3080609@sicortex.com> Nelson Castillo wrote: >Hi. > >Could you please recommend a paper for reading? I'd like to know about parallel >sorting algorithms for this architecture. > >Regards, >Nelson.- > > > I was looking into this a few months ago. Here are some good papers I found: http://citeseer.ist.psu.edu/393851.html -- Communications Conscious Radix Sort http://citeseer.ist.psu.edu/569483.html -- Parallel Algorithms for Personalized Communication and Sorting With an Experinmental Study Martin Schmollinger: Improving Communication Sensitive Parallel Radix Sort for Unbalanced Data. Euro-Par 2003 : 885-893 Schmollinger's PhD dissertation has a good chapter on this as well. -- -Larry / Sector IX From Michael.Frese at NumerEx.com Tue Dec 4 06:55:12 2007 From: Michael.Frese at NumerEx.com (Michael H. Frese) Date: Sat Oct 11 01:06:45 2008 Subject: [Beowulf] NFS Read Errors In-Reply-To: <4754ABA1.9030105@scalableinformatics.com> References: <6.2.5.6.2.20071128132559.04fcbba8@NumerEx.com> <4754ABA1.9030105@scalableinformatics.com> Message-ID: <6.2.5.6.2.20071204042727.04f6d1f0@NumerEx.com> Joe, Thanks for the suggestions. Let me make some quick corrections. At one point I knew about md5sum, but, as they say in Spanish, it forgot itself on me. You are right about the data rate on the 32 bit PCI cards: I meant 300 Mbps. As for the time for wire speed transmission of 1 GB, at 300 Mbps it is only about 30 seconds. It turns out the biggest file I am dealing with is 400 MB, not 1 GB, and the local md5sum takes only 10 seconds, indicating that the disk-to-memory speed is at least 40 MBps, which is about what I expect from this hardware, and about equal to the 300 Mbps ethernet speed on the single processor. But the remote md5sum takes almost 6 minutes to get the wrong answer. The problem with disk system or memory hypotheses is that the local md5sum is consistent, and fast. There are no unexpected messages in /var/log/messages, and there is no /var/log/syslog. The only thing I haven't checked outside the box is the cable, so I will do that, but it seems unlikely. And yes, these boxes are old, but they have served me well, and my replacements won't be up and running till the end of the month. I also was hoping to find a better configuration choice, if there is one. Mike At 06:21 PM 12/3/2007, Joe Landman wrote: >Hi Michael: > >Michael H. Frese wrote: >>We were having trouble restarting from our homegrown parallel >>magnetohydrodynamic code's checkpoint files. The files could be >>read, but funny things happened in the run afterward. Eventually >>we figured out that the restarted parallel run differed from the >>serial restarted run from the same checkpoint. >>After much gnashing of teeth and rending of apparel, we found that >>the checkpoint files were being read incorrectly across NFS. That >>let us simplify our search for the problem. We first found that >>the local md5 digest [openssl dgst -md5 (file...)] on an NFS cp'ed >>version of the file > > md5sum filename > >does the same thing with a slightly simpler syntax. There is >mounting evidence that you should use sha1sum rather than md5sum. > >>was different from that produced on the original file. What was >>interesting was that the copy either took forEVER -- like 10 >>minutes or 20 minutes for a 1 GB file -- when the final result was >>bad or it took about a minute when the file was perfect. I'm >>guessing that whatever error checking that gets done on the packets >>was rejecting so many it finally got a bad packet it couldn't tell was bad. > >Sounds a great deal like a bad disk/disk system or something mucking >with your connection to the data. 1 GB file, even at 1 MB/s is 1000 >seconds, or 16 minutes. If you have a disk which keeps timing out, >or has bad blocks, and keeps retrying, well, stuff like this can >happen, especially on old kernels (and old hardware). > >Could also be a RAM error. > >>When we found that doing the md5 digest on a remote file produced a >>different result than doing it on the processor on which the disk >>was mounted, our tests got simpler. And shorter, still, after we >>found that we could get fairly frequent failures with 10 MB files or smaller. >>Clearly we had an NFS failure, probably associated with hardware. > >Yes. I would venture a guess that you are seeing *lots* of errors >in your /var/log/syslog or /var/log/messages files. > > >>This was all between two specific nodes of our small cluster. [Old >>hardware generally: AMD Athlon 32-bit single (MSI KT4V) and dual >>(Tyan...) chip motherboards both running Redhat 9 one with the >>2.4.20-8 kernels, though one is the smp version; NetGear GA311 NICs; and a > >Owie... > >>NetGear GS108 8 port Copper 1 GB/s switch. The single processor >>motherboards have 32-bit PCI slots so their network speeds are >>limited to 300 kbps as shown by netpipe. All of the LEDs at the >>ends of the cables show 1000Mb connections.] > >300 kbps? thats 300 kilo bits per second (abbreviations are *very* >important to get right, kB/s is not the same as kb/s). 300 kbps is >usually read as 300 kilo bits per second. Or about about 37.5 kB/s. >Which is about the average speed of various DSL lines. > >I hope you mean 30 MB/s (or 240 Mb/s). > >>Then we started checking other pairs. Some were fine. Some were >>bad in the same way. So we replaced the switch, changing to a 16 >>port NetGear GS216. That seemed to cure most of the problem. But >>we continued to > >We have seen bad switches a few times. > >>have problems copying a file on one particular single processor >>machine from the others. >>That's where we are now. The md5 digest run on that machine >>consistently shows the same result, whereas the digest for that >>file produced on a remote machine will be almost stochastic. In >>some cases it will eventually settle in to the right answer, and >>then the speed goes WAY up. I suppose that happens because the >>file request can be served from the local machine's cache. But why >>doesn't it happen after it received bad blocks? > >I am guessing you are using TCP NFS mounts as well? TCP forces >retries in the event of bad packets. UDP doesn't force this, but >the NFS protocol will try. Ram errors, bad cables, burnt switches, >and machines with interrupt problems (old machines often shared >interrupts without being able to do a very good job of it). > >>Most, if not all of the original network cards in those machines >>went bad and have been replaced in the last few years, so I decided >>to try a brand new GA311. No joy there. It still gives out the >>wrong info. I guess the motherboard PCI bus controller is hinky, >>but I'm far from sure. > >Did you try a new cable? Had a few cables go bad, usually they are >marginal to begin with. > >>We are in the process of upgrading and thus replacing all the >>machines we have of that configuration due to space limitations and >>their age, but I'm still curious what the problem could be. > >There are quite a few possibilities unfortunately. Unless you plan >to use these existing machines for quite a while longer, it might be >less painful to shut off the malfunctioning node. > >>Suggestions? Comments? > >2.4.20? Athlons? I would say a serious hardware/OS refresh is in order :) > > > >-- >Joseph Landman, Ph.D >Founder and CEO >Scalable Informatics LLC, >email: landman@scalableinformatics.com >web : http://www.scalableinformatics.com > http://jackrabbit.scalableinformatics.com >phone: +1 734 786 8423 >fax : +1 866 888 3112 >cell : +1 734 612 4615 From dnlombar at ichips.intel.com Tue Dec 4 07:17:48 2007 From: dnlombar at ichips.intel.com (Lombard, David N) Date: Sat Oct 11 01:06:45 2008 Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded libraries with MPI In-Reply-To: <4752C689.5030102@gmail.com> References: <6DB5B58A8E5AB846A7B3B3BFF1B4315A0184F6FD@AVEXCH1.qlogic.org> <474FEF18.6020308@obs.unige.ch> <4752C689.5030102@gmail.com> Message-ID: <20071204151748.GA26106@nlxdcldnl2.cl.intel.com> On Sun, Dec 02, 2007 at 03:51:53PM +0100, Toon Knapen wrote: > Mark Hahn wrote: > >>IMHO the hybris approach (MPI+threads) is interesting in case every > >>MPI-process has lots of local data. > > > >yes. but does this happen a lot? the appealing case would be threads > >that make lots of heavy use of some large data, _but_ > >without needing synchronization/locking. once you need locking > >among the threads, message passing starts to catch up. > > Direct solvers (for Finite Elements for instance) need a lot of data. > Additionally distributing the matrix generate interfaces (between the > different submatrices) which are hard to solve. In such situation, one > tries to minimize the number of interfaces (by having one submatrix per > MPI-process) and speed up the solving of each submatrix using threads. Yes, this is my direct experience with hybrid programming. An automated domain decomp is used to partition the model, and then threads (either native or OpenMP) are used within the domain. -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From dnlombar at ichips.intel.com Tue Dec 4 07:28:52 2007 From: dnlombar at ichips.intel.com (Lombard, David N) Date: Sat Oct 11 01:06:45 2008 Subject: [Beowulf] Recommended paper for parallel sorting? In-Reply-To: References: <2accc2ff0712011938r58867701tc9988135f9edeb2a@mail.gmail.com> Message-ID: <20071204152852.GB26106@nlxdcldnl2.cl.intel.com> On Mon, Dec 03, 2007 at 01:38:12PM -0500, Robert G. Brown wrote: > On Mon, 3 Dec 2007, Peter St. John wrote: > > >(re Ian Foster, *Designing and Building Parallel Programs *online as below > >or Addison Wesley): > > > >I did that search and right the top was this link, which looks like > >homebase > >for the original material: > >http://www-unix.mcs.anl.gov/dbpp/ > >Very cool, thanks RGB for what looks like toothsome book. > > I went ahead and bought a paper copy, but it is nice to be able to > access the material from a workstation because I don't carry the copy > around with me all the time...;-) Whenever I find an example of both print and online copies of any reasonable text, I'll make sure I buy the print copy to reward such behavior. The Rute and SVN books are two additional examples. -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From Michael.Frese at NumerEx.com Tue Dec 4 07:54:24 2007 From: Michael.Frese at NumerEx.com (Michael H. Frese) Date: Sat Oct 11 01:06:45 2008 Subject: [Beowulf] NFS Read Errors In-Reply-To: References: <6.2.5.6.2.20071128132559.04fcbba8@NumerEx.com> <4754ABA1.9030105@scalableinformatics.com> Message-ID: <6.2.5.6.2.20071204085359.04f72018@NumerEx.com> Mark, Thanks for your helpful comments. At 11:31 PM 12/3/2007, you wrote: >>I am guessing you are using TCP NFS mounts as well? TCP forces >>retries in the event of bad packets. UDP doesn't force this, but >>the NFS protocol will > >UDP has a checksum as well, though it's only 16b. then again, the TCP >checksum isn't all that strong for today's data rates either. From reading the man page on nfs on the systems with the 2.4 kernels, it looks like the default for an nfs mount is udp. It also looks like tcp is not really an option until nfs v4, so it may be something to try on the 2.6 kernels that I have on some of my newer machines at another site. >you should definitely examine /proc/net/dev on involved machines. I hadn't known about /proc/net/dev. When I check there, I see no transmit errors on the server side and no receive errors on the client side. That's odd, because the other thing I see is that the average packet size received (bytes received divided by packets received) on the client side is 3.9, while on the server side, the average packet size sent is 1430. In other words, there are a many more packets received than there ought to be. That's very fishy. It's probably the result of the way the packet count is done and reported. I.e., it may be that all the received packets -- good and bad -- are counted, but only the bytes in the good ones are counted, with some similar problem on the server side. I think the statistics are aggregate since the last boot, so they may not be just from the troublesome tests I was performing, either. >I would attempt to reduce the complexity of your testing. >for instance, can a node write and verify to its local disk >without problem? The local disk read seems rock solid in comparison to the NFS one. The local md5sum produces the same result time after time, which is just not the case for the remote. >can it stream data over tcp sockets (netcat or the like) without >corruption or obvious problems reflected >in /proc/net/dev? netcat is not on my systems. Looks like I have to get someone to download and build it for me, and try the streaming tests you recommend. >does ethtool tell you anything about the config of the nic? Not on the 2.4 systems, though it seems to tell me a little on the 2.6's. >comparing tcp vs udp NFS would be sensible >as well - varying the packet size, too. switching client and/or >server to a modern 2.6 kernel may be instructive. Upgrading the kernel is probably the only way I'll get nfs over tcp. Given that these systems are headed out the door, I'm not sure that's a good use of our time. But it may be worth doing an our new and newer systems. Thanks again! Mike From jlb17 at duke.edu Tue Dec 4 09:24:54 2007 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Sat Oct 11 01:06:45 2008 Subject: [Beowulf] NFS Read Errors In-Reply-To: <6.2.5.6.2.20071204085359.04f72018@NumerEx.com> References: <6.2.5.6.2.20071128132559.04fcbba8@NumerEx.com> <4754ABA1.9030105@scalableinformatics.com> <6.2.5.6.2.20071204085359.04f72018@NumerEx.com> Message-ID: On Tue, 4 Dec 2007 at 8:54am, Michael H. Frese wrote > From reading the man page on nfs on the systems with the 2.4 kernels, it > looks like the default for an nfs mount is udp. It also looks like tcp is > not really an option until nfs v4, so it may be something to try on the 2.6 > kernels that I have on some of my newer machines at another site. NFSv3 over TCP is the default for most modern distros (obviously this rules out your setup ;). I honestly don't remember if it was supported in RH9 (I think it was, but that was many moons ago) but it'd be easy to test. Just add 'tcp' to the mount options in /etc/fstab and try the mount. If it's not supported, it won't work. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF From mousavi.ehsan at gmail.com Mon Dec 3 21:47:36 2007 From: mousavi.ehsan at gmail.com (Ehsan Mousavi) Date: Sat Oct 11 01:06:45 2008 Subject: [Beowulf] CSharifi Next generation of HPC In-Reply-To: <4752C689.5030102@gmail.com> References: <6DB5B58A8E5AB846A7B3B3BFF1B4315A0184F6FD@AVEXCH1.qlogic.org> <474FEF18.6020308@obs.unige.ch> <4752C689.5030102@gmail.com> Message-ID: C-Sharifi Cluster Engine: The Second Success Story on "Kernel-Level Paradigm" for Distributed Computing Support Contrary to two school of thoughts in providing system software support for distributed computation that advocate either the development of a whole new distributed operating system (like Mach), or the development of library-based or patch-based middleware on top of existing operating systems (like MPI, Kerrighed and Mosix), Dr. Mohsen Sharifi hypothesized another school of thought as his thesis in 1986 that believes all distributed systems software requirements and supports can be and must be built at the Kernel Level of existing operating systems; requirements like Ease of Programming, Simplicity, Efficiency, Accessibility, etc which may be coined as Usability. Although the latter belief was hard to realize, a sample byproduct called DIPC was built purely based on this thesis and openly announced to the Linux community worldwide in 1993. This was admired for being able to provide necessary supports for distributed communication at the Kernel Level of Linux for the first time in the world, and for providing Ease of Programming as a consequence of being realized at the Kernel Level. However, it was criticized at the same time as being inefficient. This did not force the school to trade Ease of Programming for Efficiency but instead tried hard to achieve efficiency, alongside ease of programming and simplicity, without defecting the school that advocates the provision of all needs at the kernel level. The result of this effort is now manifested in the C-Sharifi Cluster Engine. C-Sharifi is a cost effective distributed system software engine in support of high performance computing by clusters of off-the-shelf computers. It is wholly implemented in Kernel, and as a consequence of following this school, it has Ease of Programming, Ease of Clustering, Simplicity, and it can be configured to fit as best as possible to the efficiency requirements of applications that need high performance. It supports both distributed shared memory and message passing styles, it is built in Linux, and its cost/performance ratio in some scientific applications (like meteorology and cryptanalysis) has shown to be far better than non-kernel-based solutions and engines (like MPI, Kerrighed and Mosix). Best Regard ~Ehsan Mousavi C-Sharifi Development Team -----Original Message----- From: beowulf-bounces@beowulf.org [mailto:beowulf-bounces@beowulf.org] On Behalf Of Toon Knapen Sent: Sunday, December 02, 2007 6:22 PM To: Mark Hahn Cc: Beowulf Mailing List Subject: Re: [Beowulf] Using Autoparallel compilers or Multi-Threaded libraries with MPI Mark Hahn wrote: >> IMHO the hybris approach (MPI+threads) is interesting in case every >> MPI-process has lots of local data. > > yes. but does this happen a lot? the appealing case would be threads > that make lots of heavy use of some large data, _but_ > without needing synchronization/locking. once you need locking > among the threads, message passing starts to catch up. Direct solvers (for Finite Elements for instance) need a lot of data. Additionally distributing the matrix generate interfaces (between the different submatrices) which are hard to solve. In such situation, one tries to minimize the number of interfaces (by having one submatrix per MPI-process) and speed up the solving of each submatrix using threads. Finance is another example. Financial applications need to evaluate a large number of open positions based on the simulated, current or past market-data. There are many dependencies between all the different data which makes that it is hard to decompose the data in largely independent chunks. > >> latter is simpler because it only requires MPI-parallelism but if the >> code >> is memory-bound and every mpi-process has much of the same data, it >> will be >> better to share this common data with all processes on the same cpu >> and thus >> use threads intra-node. > > what kind of applications behave like that? I agree that if your MPI > app is keeping huge amounts of (static) data replicated in each rank, > you should rethink your design. > See above. _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From gmichal at uow.edu.au Tue Dec 4 00:15:18 2007 From: gmichal at uow.edu.au (Guillaume Michal) Date: Sat Oct 11 01:06:45 2008 Subject: [Beowulf] A cluster for material simulation In-Reply-To: References: Message-ID: Space in not an issue at all in fact but as a mech engineer I'm more: "the less parts the better", so I tend to try to factorise and make it as simple as possible. The heat won't be a problem as air conditioning exists in the room. In term of tasks sizes, 10G is what we need "tomorrow", as our understanding of the cluster increase, we will increase the size of the problems. By the way, thank you for your indications, I'm going to to think a bit more about all that, and... try to understand what IPMI are all about ;-) Guillaume From nelsoneci at gmail.com Tue Dec 4 05:52:43 2007 From: nelsoneci at gmail.com (Nelson Castillo) Date: Sat Oct 11 01:06:45 2008 Subject: [Beowulf] Recommended paper for parallel sorting? In-Reply-To: <47555A3B.3080609@sicortex.com> References: <2accc2ff0712011938r58867701tc9988135f9edeb2a@mail.gmail.com> <47555A3B.3080609@sicortex.com> Message-ID: <2accc2ff0712040552n5e5f222fo2c241aa11fb577b6@mail.gmail.com> On Dec 4, 2007 8:46 AM, Larry Stewart wrote: (cut) > I was looking into this a few months ago. Here are some good papers I > found: > > http://citeseer.ist.psu.edu/393851.html -- Communications Conscious > Radix Sort > > http://citeseer.ist.psu.edu/569483.html -- Parallel Algorithms for > Personalized Communication and Sorting With an Experinmental Study > > Martin Schmollinger: Improving Communication Sensitive Parallel Radix > Sort for Unbalanced Data. Euro-Par 2003 > : > 885-893 > > Schmollinger's PhD dissertation has a good chapter on this as well. > > -- > -Larry / Sector IX Thanks a lot for all your responses. I am very curious about Parallel Radix Sort. I've read and watched the 5th lecture of this course, and I wanted to know more about parallel implementations. I've found many papers in the subject, but in this case I preferred to ask for the relevant ones since it is easy to get lost with papers that are not that good. http://ocw.mit.edu/OcwWeb/Electrical-Engineering-and-Computer-Science/6-046JFall-2005/LectureNotes/index.htm Regards. -- http://arhuaco.org From examachine at gmail.com Tue Dec 4 06:18:35 2007 From: examachine at gmail.com (Eray Ozkural) Date: Sat Oct 11 01:06:45 2008 Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded libraries with MPI In-Reply-To: References: <120320072229.3467.475483640006184A00000D8B2200748184089C040E99D20B9D0E080C079D@comcast.net> Message-ID: <320e992a0712040618t5e8a2f9ch7bb26867ffeba8ba@mail.gmail.com> On Dec 4, 2007 2:53 PM, Robert G. Brown wrote: > On Mon, 3 Dec 2007, richard.walsh@comcast.net wrote: > > > impossibility. I am still waiting to get a straight flush in 5-card > > draw. > > Are ye, now... interesting. > > Sometime we'll have to wait together. In the meantime, I find that if > you play the game with a wild card or eight it alters the odds > magnificently. Why, you can get a straight flush and still lose the > game...;-) I think for many types of code pure MPI code would be much easier to develop, granted, but an auto parallel compiler can choose to use either type (for certain types of codes where the compiler would work at all). Am I speaking the obvious? Where it suits, the multithreaded code can be much faster than MPI code as it can avoid copying large messages. Depends very much on what type of communication there is in the algorithm. Maybe Greg is right for the majority of X kind of code, I wouldn't have a problem with that statement, but in general I'm quite doubtful that there can be no performance gains. Best, -- Eray Ozkural, PhD candidate. Comp. Sci. Dept., Bilkent University, Ankara http://www.cs.bilkent.edu.tr/~erayo Malfunct: http://myspace.com/malfunct ai-philosophy: http://groups.yahoo.com/group/ai-philosophy From rokrau at yahoo.com Tue Dec 4 09:02:27 2007 From: rokrau at yahoo.com (Roland Krause) Date: Sat Oct 11 01:06:45 2008 Subject: [Beowulf] Recommended paper for parallel sorting? In-Reply-To: Message-ID: <861246.65508.qm@web81113.mail.mud.yahoo.com> Speaking of Ian Foster's books. Does anyone have an opinion about this one? The Sourcebook of Parallel Computing (The Morgan Kaufmann Series in Computer Architecture and Design) This book seems to be quite a bit newer, has a different focus obviously, but I'd like to know what you think about it? Thanks, Roland --- "Robert G. Brown" wrote: > You might check out Ian Foster's free online book on parallel > algorithms. It is worth buying if you're going to be doing a lot of > parallel programming. Or there are two or three other decent > textbooks > on parallel programming at the algorithm level. I don't recall > offhand > if Foster covers sorting, but you can easily found out for free. > From mg.mailing-list at laposte.net Tue Dec 4 11:03:34 2007 From: mg.mailing-list at laposte.net (Mathieu Gontier) Date: Sat Oct 11 01:06:45 2008 Subject: [Beowulf] use a MPI library thought a shared library Message-ID: <4755A486.5000109@laposte.net> Hi all, I am currently working with a project named MorphMPI. Its main purpose is to offer a generic interface for the developers of parallel applications, and chose the MPI library/interconnect at the runtime by rebuilding a shared morph library against the desire MPI library. (The final application is linked against a shared morph library instead of the real MPI library.) For more information about that, you can follow these links: - http://www.clustermonkey.net//content/view/213/32/ - http://sourceforge.net/projects/morphmpi So, I meet a little problem whatever the MPI library used (I tried with MPICH-1.2.5.2, MPICHGM and IntelMPI). When MorphMPI is linked statically with my parallel application, everything is ok; but when MorphMPI is linked dynamically with my parallel application, MPI_Get_count return a wrong value. I concluded it is difficult to use a MPI library thought a shared library. I wonder if someone have more information about it (in this case, you're welcome ;-) ) Thank you for your support, Mathieu. PS: my problem happens in the the following example, # include # include #include int main( int argc, char* argv[] ) { int np, me, ier, flag=0, msglen=-1 ; MorphMPI_Request request ; MorphMPI_Status status ; int buf[1] ; buf[0]=-1 ; ier = MorphMPI_Init( &argc, &argv ) ; ier = MorphMPI_Comm_size( MorphMPI_COMM_WORLD, &np ) ; ier = MorphMPI_Comm_rank( MorphMPI_COMM_WORLD, &me ) ; if( me > 1 ) printf( "I am the useless processor #%d on %d\n", me, np ) ; else printf( "I am the working processor #%d on %d\n", me, np ) ; ier = MorphMPI_Barrier( MorphMPI_COMM_WORLD ) ; printf( "<<< %d >>>\n", &status ) ; if( ! me ) { buf[0] = 69 ; ier = MorphMPI_Isend( buf, 1, MorphMPI_INT, 1,1, MorphMPI_COMM_WORLD, &request ) ; ier = MorphMPI_Wait( &request, &status ) ; } ier = MorphMPI_Barrier( MorphMPI_COMM_WORLD ) ; if( me == 1 ) { ier = MorphMPI_Irecv( buf, 1, MorphMPI_INT, 0, 1, MorphMPI_COMM_WORLD, &request ) ; ier = MorphMPI_Wait( &request, &status ) ; ier = MorphMPI_Get_count( &status, MorphMPI_INT, &msglen ) ; if( msglen != 1 ) printf( "ERROR: The lengh of the message is not 1\n" ) ; else printf( "SUCCESS !\n" ) ; } ier = MorphMPI_Finalize() ; } -- Mathieu Gontier Core Development Engineer Read the attached v-card for telephone, fax, adress Look at our web-site http://www.fft.be From larry.stewart at sicortex.com Tue Dec 4 11:49:11 2007 From: larry.stewart at sicortex.com (Larry Stewart) Date: Sat Oct 11 01:06:45 2008 Subject: [Beowulf] Intel MPI Benchmark maintainers? In-Reply-To: References: <6DB5B58A8E5AB846A7B3B3BFF1B4315A0184F6FD@AVEXCH1.qlogic.org> <474FEF18.6020308@obs.unige.ch> <4752C689.5030102@gmail.com> Message-ID: <4755AF37.2000807@sicortex.com> Does anyone know where to send bug fixes for the Intel MPI Benchmarks? Simple stuff - bad printfs in error handling paths, but I can't find an email address for such things. -L From peter.st.john at gmail.com Tue Dec 4 12:05:23 2007 From: peter.st.john at gmail.com (Peter St. John) Date: Sat Oct 11 01:06:45 2008 Subject: [Beowulf] use a MPI library thought a shared library In-Reply-To: <4755A486.5000109@laposte.net> References: <4755A486.5000109@laposte.net> Message-ID: Mathieu, I didn't spot why you included ? It seems you work thru morph_mpi.h wrappers, right? Perhaps I misunderstand? Peter On Dec 4, 2007 2:03 PM, Mathieu Gontier wrote: > Hi all, > > I am currently working with a project named MorphMPI. Its main purpose > is to offer a generic interface for the developers of parallel > applications, and chose the MPI library/interconnect at the runtime by > rebuilding a shared morph library against the desire MPI library. (The > final application is linked against a shared morph library instead of > the real MPI library.) > For more information about that, you can follow these links: > - http://www.clustermonkey.net//content/view/213/32/ > - http://sourceforge.net/projects/morphmpi > > So, I meet a little problem whatever the MPI library used (I tried with > MPICH-1.2.5.2, MPICHGM and IntelMPI). > When MorphMPI is linked statically with my parallel application, > everything is ok; but when MorphMPI is linked dynamically with my > parallel application, MPI_Get_count return a wrong value. > > I concluded it is difficult to use a MPI library thought a shared > library. I wonder if someone have more information about it (in this > case, you're welcome ;-) ) > > Thank you for your support, > Mathieu. > > PS: my problem happens in the the following example, > > # include > > # include > > #include > > > int main( int argc, char* argv[] ) > > { > > int np, me, ier, flag=0, msglen=-1 ; > > MorphMPI_Request request ; > > MorphMPI_Status status ; > > int buf[1] ; buf[0]=-1 ; > > > ier = MorphMPI_Init( &argc, &argv ) ; > > ier = MorphMPI_Comm_size( MorphMPI_COMM_WORLD, &np ) ; > > ier = MorphMPI_Comm_rank( MorphMPI_COMM_WORLD, &me ) ; > > > if( me > 1 ) printf( "I am the useless processor #%d on %d\n", me, np ) ; > > else printf( "I am the working processor #%d on %d\n", me, np ) ; > > > ier = MorphMPI_Barrier( MorphMPI_COMM_WORLD ) ; > > > printf( "<<< %d >>>\n", &status ) ; > > > if( ! me ) { > > buf[0] = 69 ; > > ier = MorphMPI_Isend( buf, 1, MorphMPI_INT, 1,1, MorphMPI_COMM_WORLD, > &request ) ; > > ier = MorphMPI_Wait( &request, &status ) ; > > } > > > ier = MorphMPI_Barrier( MorphMPI_COMM_WORLD ) ; > > > if( me == 1 ) { > > ier = MorphMPI_Irecv( buf, 1, MorphMPI_INT, 0, 1, MorphMPI_COMM_WORLD, > &request ) ; > > ier = MorphMPI_Wait( &request, &status ) ; > > ier = MorphMPI_Get_count( &status, MorphMPI_INT, &msglen ) ; > > > if( msglen != 1 ) printf( "ERROR: The lengh of the message is not 1\n" > ) ; > > else printf( "SUCCESS !\n" ) ; > > } > > > ier = MorphMPI_Finalize() ; > > } > > > > -- > Mathieu Gontier > Core Development Engineer > > Read the attached v-card for telephone, fax, adress > Look at our web-site http://www.fft.be > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20071204/7be2dfbb/attachment.html From henry.gabb at intel.com Tue Dec 4 12:13:15 2007 From: henry.gabb at intel.com (Gabb, Henry) Date: Sat Oct 11 01:06:45 2008 Subject: [Beowulf] RE: Intel MPI Benchmark maintainers? In-Reply-To: <200712042000.lB4K03ms028979@bluewest.scyld.com> Message-ID: <4D97B70CF7F72144881F66DFF4BD7A12031AFE82@fmsmsx413.amr.corp.intel.com> Hi Larry, The Intel MPI Benchmarks are part of the Intel Cluster Toolkit (http://www3.intel.com/cd/software/products/asmo-na/eng/307696.htm) so you can submit bug reports to your Premier Support account for ICT. If you don't have a Premier account, you can send the bug reports directly to me. I'll make sure they get to the right place. Henry Gabb Intel Cluster Software and Technologies From mathog at caltech.edu Tue Dec 4 12:45:25 2007 From: mathog at caltech.edu (David Mathog) Date: Sat Oct 11 01:06:45 2008 Subject: [Beowulf] Re: NFS Read Errors Message-ID: I missed the beginning of this thread - what were the parameters in /etc/fstab on the client? Unless hard mounts are used it is possible for a block of null bytes to end up in the file where data was supposed to be. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From rosing at peakfive.com Tue Dec 4 13:08:08 2007 From: rosing at peakfive.com (Matt Rosing) Date: Sat Oct 11 01:06:45 2008 Subject: [Beowulf] Re: use a MPI library thought a shared library In-Reply-To: <200712042000.lB4K03ms028979@bluewest.scyld.com> References: <200712042000.lB4K03ms028979@bluewest.scyld.com> Message-ID: <18261.49592.175837.718416@lala.site> > From: Mathieu Gontier > > So, I meet a little problem whatever the MPI library used (I tried with > MPICH-1.2.5.2, MPICHGM and IntelMPI). > When MorphMPI is linked statically with my parallel application, > everything is ok; but when MorphMPI is linked dynamically with my > parallel application, MPI_Get_count return a wrong value. I'm guessing your machine is suffering from version hell and your LD_LIBRARY_PATH environment variable doesn't match your Makefile. We use modules and someone else figures all that out. Hope this helps, Matt From landman at scalableinformatics.com Tue Dec 4 13:22:32 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Sat Oct 11 01:06:45 2008 Subject: [Beowulf] Re: NFS Read Errors In-Reply-To: References: Message-ID: <4755C518.5070409@scalableinformatics.com> David Mathog wrote: > I missed the beginning of this thread - what were the parameters > in /etc/fstab on the client? > > Unless hard mounts are used it is possible for a block of > null bytes to end up in the file where data was supposed to be. I think his issue is one of an over-zealous retry loop somewhere ... He is using udp mounts by default (could do a "mount -o remount,tcp /path" to change to tcp, but I don't think this will help). It sounded to me like a bad HD, but his local HD reads/writes seem ok (is this correct)? It could be a) bad driver b) bad NIC c) bad PCI slot d) bad cable e) bad switch f) bad switch port g) other things :) The gear he was using is *old*, and the distro is a 2.4.20 based thing (RH9 I think?). If it is worth the time and effort to hunt it down, I might suggest investing in a pair of new (different NICs) putting them in a node with a crossover cable, and making sure he can pass data back and forth without issue. Then see if the problem emerges in changing one thing at a time (or bisect the search space, but the list is short enough that either one would work well). > > Regards, > > David Mathog > mathog@caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From landman at scalableinformatics.com Tue Dec 4 13:49:05 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Sat Oct 11 01:06:45 2008 Subject: [Beowulf] use a MPI library thought a shared library In-Reply-To: <4755A486.5000109@laposte.net> References: <4755A486.5000109@laposte.net> Message-ID: <4755CB51.5050802@scalableinformatics.com> Greetings Mathieu: Mathieu Gontier wrote: [...] > So, I meet a little problem whatever the MPI library used (I tried with > MPICH-1.2.5.2, MPICHGM and IntelMPI). > When MorphMPI is linked statically with my parallel application, > everything is ok; but when MorphMPI is linked dynamically with my > parallel application, MPI_Get_count return a wrong value. > > I concluded it is difficult to use a MPI library thought a shared > library. I wonder if someone have more information about it (in this Not likely. I would suggest ldd. It is your friend. For example: joe@pegasus-i:~/workspace/source-mpi$ ldd matmul_mpi_3.exe libm.so.6 => /lib/libm.so.6 (0x00002b5409d17000) libmpi.so.0 => not found libopen-rte.so.0 => not found libopen-pal.so.0 => not found librt.so.1 => /lib/librt.so.1 (0x00002b5409f99000) libdl.so.2 => /lib/libdl.so.2 (0x00002b540a1a2000) libnsl.so.1 => /lib/libnsl.so.1 (0x00002b540a3a6000) libutil.so.1 => /lib/libutil.so.1 (0x00002b540a5c0000) libpthread.so.0 => /lib/libpthread.so.0 (0x00002b540a7c3000) libc.so.6 => /lib/libc.so.6 (0x00002b540a9de000) /lib64/ld-linux-x86-64.so.2 (0x00002b5409af9000) Notice that libmpi.so.0 is not found, so I can't run this by hand. Unless I force the issue using LD_LIBRARY_PATH joe@pegasus-i:~/workspace/source-mpi$ export LD_LIBRARY_PATH="/home/joe/local/lib64/:/home/joe/local/lib/" joe@pegasus-i:~/workspace/source-mpi$ ldd matmul_mpi_3.exe libm.so.6 => /lib/libm.so.6 (0x00002ae35ca50000) libmpi.so.0 => /home/joe/local/lib/libmpi.so.0 (0x00002ae35ccd1000) libopen-rte.so.0 => /home/joe/local/lib/libopen-rte.so.0 (0x00002ae35cfe8000) libopen-pal.so.0 => /home/joe/local/lib/libopen-pal.so.0 (0x00002ae35d2b3000) librt.so.1 => /lib/librt.so.1 (0x00002ae35d514000) libdl.so.2 => /lib/libdl.so.2 (0x00002ae35d71d000) libnsl.so.1 => /lib/libnsl.so.1 (0x00002ae35d921000) libutil.so.1 => /lib/libutil.so.1 (0x00002ae35db3b000) libpthread.so.0 => /lib/libpthread.so.0 (0x00002ae35dd3e000) libc.so.6 => /lib/libc.so.6 (0x00002ae35df59000) /lib64/ld-linux-x86-64.so.2 (0x00002ae35c832000) and it might even run ... joe@pegasus-i:~/workspace/source-mpi$ ./matmul_mpi_3.exe D[tid=0]: running on machine = pegasus-i D: checking arguments: N_args=1 D: arg[0] = ./matmul_mpi_3.exe Allocating memory ... array size in MB = 7.629 MB (remember, you have 2 of these)normalization a: 0.05510, b: 0.00173 0 : loop_min = 0, loop_max = 1000 ... Do you have some sort of LD_LIBRARY_PATH set up? Or something set in /etc/ld.so.config that points to where these things are? Remember, mpirun/mpiexec's alternative purpose in life is to set up the correct run time environment for you, so you might want to see what is going on with the environment in your equivalent command. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From Michael.Frese at NumerEx.com Tue Dec 4 13:54:51 2007 From: Michael.Frese at NumerEx.com (Michael H. Frese) Date: Sat Oct 11 01:06:45 2008 Subject: [Beowulf] Re: NFS Read Errors In-Reply-To: References: Message-ID: <6.1.2.0.2.20071204144808.06568008@themis.numerex.com> David, The fstab mount parameters are 'rw,hard,bg', so I think that's not the problem. I'll send you my original missive separately. Thanks. Mike At 03:01 PM 12/4/2007, David Mathog wrote: >I missed the beginning of this thread - what were the parameters >in /etc/fstab on the client? > >Unless hard mounts are used it is possible for a block of >null bytes to end up in the file where data was supposed to be. > >Regards, > >David Mathog >mathog@caltech.edu >Manager, Sequence Analysis Facility, Biology Division, Caltech >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20071204/6c80adcb/attachment.html From becker at scyld.com Tue Dec 4 14:00:52 2007 From: becker at scyld.com (Donald Becker) Date: Sat Oct 11 01:06:45 2008 Subject: [Beowulf] CSharifi Next generation of HPC In-Reply-To: Message-ID: [[[ Hmmmm, OK, I seem to have moderation-approved pretty much a repeat of a wide-spread posting. So I'll answer with the response I was planning a few days ago. ]] On Tue, 4 Dec 2007, Ehsan Mousavi wrote: > C-Sharifi Cluster Engine: The Second Success Story on "Kernel-Level > Paradigm" for Distributed Computing Support > > Contrary to two school of thoughts in providing system software support for > (like MPI, Kerrighed and Mosix), Dr. Mohsen Sharifi hypothesized another > school of thought as his thesis in 1986 that believes all distributed > systems software requirements and supports can be and must be built at the > Kernel Level of existing operating systems; In 1986 I had been working for a few years on shared memory systems with a hefty proportion of custom-designed hardware. I learned from that experience. That's why I now work on distributed memory systems based on off-the-shelf commodity hardware. I also think that there are some important aspects of cluster infrastructure that (at present) can only be implemented by tweaking the kernel. But most of the features to make a cluster easy to use don't need special kernel support, and indeed can't be implemented inside the kernel at all. You might initially think "you can put any program inside the kernel, therefore you can do everything inside the kernel". But as a counter-example consider name services. Essentially all programs use the standard library interface to name services, which in turn uses the Name Service Switch. You can add a bunch of really powerful feature by using a cluster-specific name service. And this can only be done by working with the existing user-level library code. (Well, unless you build a new library within your kernel.) This argument almost misses the main point: Cluster systems exist for to simplify the system for the end users. When you think in terms of kernel modifications, most of the changes end up being tricks to prove to other developers how clever you are, not features that make the system easier to use (example: Plan 9). And most of the clever tricks end up getting in the way of the developer, rather than speeding up the application or really simplifying the programming model. DSM / Distributed Shared Memory (which I prefer to call NVM, Network Virtual Memory) is a prefect example of this. It certainly doesn't help the end user. The only aspect an end user or system administrator sees is that NVM causes cascading system failures when one machine drops out of the cluster. The programmer doesn't benefit either. They initially think that NVM gives them an easy to use shared memory model. They quickly find that it only appears to be normal memory. To get even barely acceptable performance they have to treat the shared memory very differently than regular memory. Variables written by different processes have to be segregated into different pages. Writes have to grouped. You have to think about when to manually cache structures to avoid a re-read that might trigger a network page fault, but refresh that structure when you need potentially updated values. Many independent attempts have concluded that most application ports take a long time to tune for NVM, and almost all end up using NVM as a stylized message passing mechanism. -- Donald Becker becker@scyld.com Penguin Computing / Scyld Software www.penguincomputing.com www.scyld.com Annapolis MD and San Francisco CA From hahn at mcmaster.ca Wed Dec 5 08:22:55 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Sat Oct 11 01:06:45 2008 Subject: [Beowulf] CSharifi Next generation of HPC In-Reply-To: References: Message-ID: > DSM / Distributed Shared Memory (which I prefer to call NVM, Network > Virtual Memory) is a prefect example of this. It certainly doesn't help I think the 'N' is a valuable change, but would suggest NSM is even better. to me, the V hints too much of paging-type VM, and doesn't hint at the main point (sharing). > the end user. The only aspect an end user or system administrator sees is > that NVM causes cascading system failures when one machine drops out of > the cluster. a really good NSM implementation might well provide some kind of persistence, even replication of the space. it would be tricky to do without introducing some sort of transactional support, though, and that seriously complicates the user-level interface. of course, people who do this sort of thing often worry about different consistency models which require transaction-like directives anyway. again the programmer's interface becomes not so simple. > The programmer doesn't benefit either. They initially > think that NVM gives them an easy to use shared memory model. They > quickly find that it only appears to be normal memory. To get even barely > acceptable performance they have to treat the shared memory very > differently than regular memory. Variables written by different processes > have to be segregated into different pages. Writes have to grouped. You > have to think about when to manually cache structures to avoid a re-read > that might trigger a network page fault, but refresh that structure when > you need potentially updated values. well put. I was pondering how to say this while also pointing out that even within a single machine, programmers really cannot think memory is flat. that is, you have to program for your caches. level latency size concurrency register <.5 ns 8B 1-10? (renaming) L1 1-2 ns 64B ~2 L2/3 4-20 ns 64B ~1 ram 50-80 ns 64B 1-4 remote 5+ us 4KB 1 swap 10 ms >=4KB 1 the 'remote' there is for a reference to an NSM page that has to be brought over the net, and is assuming a fast interconnect. it's effectively the same as an MPI send and receive. notice that you can't really express just a send with NSM (it would be a blind write). I think NSM is attractive mainly at a shallow level: either for very simple, limited applications which just want to replicate a chunk of read-only shared memory across machines, or cases where details like locking and locality haven't been thought out yet. From Michael.Frese at NumerEx.com Wed Dec 5 08:55:50 2007 From: Michael.Frese at NumerEx.com (Michael H. Frese) Date: Sat Oct 11 01:06:45 2008 Subject: [Beowulf] NFS Read Errors In-Reply-To: <6.2.5.6.2.20071204085359.04f72018@NumerEx.com> References: <6.2.5.6.2.20071128132559.04fcbba8@NumerEx.com> <4754ABA1.9030105@scalableinformatics.com> <6.2.5.6.2.20071204085359.04f72018@NumerEx.com> Message-ID: <6.2.5.6.2.20071205085308.04eff510@NumerEx.com> This tale is at an end, I think, because I can't bear to tell it much longer. As many have suggested, there is probably a hardware problem, and since the hardware is old, I will do without the services of the troublesome machines -- It turns out that there is another acting up as well -- till they are replaced in a couple of weeks. Many thanks to all who racked their brains for helpful suggestions. I want to tell a little more of what I have learned, before I drop the subject altogether. First, I did swap the cable of the bad machine with that of a good one with no effect on either machine. This eliminates the possibility of the cable or the switch port being bad. Since I had previously changed out the NIC and the switch, the only possibilty is something inside the machine itself, probably the motherboard, but possibly a corrupted kernel module for handling udp -- more on that below. Second, we could find no sign of this failure in any log. Nor did /proc/net/dev show any errors. The suggestion is that older kernels aren't going to detect and report such errors. I think that's because they do nfs over udp. More about that in a moment. Third, though netcat isn't on these systems, nc is. We didn't get around to trying it, because we found ttcp. Fourth, with ttcp over tcp, I found that the troubled machine could send 800 MB in about 20 seconds -- the wire speed for those 32-bit PCI slots as tested by netpipe. However, if I used ttcp over udp, I couldn't reliably send even ten 8192-byte blocks! Successive sends and receives would receive 3, or 1, or 5 blocks. Don't ask me how these two facts are compatible. I don't know. Clearly, this puts a premium on using tcp for nfs. All our attempts to do that failed. Well, both of them, anyway. In the first one, we unmounted the offending disk, modified its fstab entry, and remounted it. We were pretty careful in the second one, where we added tcp to the fstab argument, unmounted all the remote disks, restarted all the nfsd's, and did 'mount -a'. We got an error message in both cases that didn't obviously refer to the tcp argument, but the mount didn't happen. As I write this, I see references to tcp mount requests in the mountd man page, so maybe we need to do a bit more here. The Wikipedia article on nfs says this: "At the time of introduction of Version 3, vendor support for TCP as a transport-layer protocol began increasing. While several vendors had already added support for NFS Version 2 with TCP as a transport, Sun Microsystems added support for TCP as a transport for NFS at the same time it added support for Version 3." I'd like to know what version of nfs this server supports, but the man page on nfsd doesn't say. The man page on rpc.mountd says that it supports nfs version 2 and version 3, but that "If the NFS kernel module was compiled without support for NFSv3, rpc.mountd must be invoked with the option --no-nfs-version 3." Yet the /proc/procnum/cmdline for the running rpc.mountd doesn't show a --no-nfs-version argument. Clearly, both the kernel and the server need to support the use of tcp. I'd like to get any of our other machines with these older kernels at other sites to using tcp for nfs where possible, in order to avoid this in the future. We are already seeing signs of network problems on them. If that's not possible, then in order to avoid a complete rebuild of those systems -- there are 12 of them -- we are going to put a testing script together using remote invocations of md5sum and comparison of results to recorded local results. Thanks again! Mike At 08:54 AM 12/4/2007, you wrote: >Mark, > >Thanks for your helpful comments. > >At 11:31 PM 12/3/2007, you wrote: >>>I am guessing you are using TCP NFS mounts as well? TCP forces >>>retries in the event of bad packets. UDP doesn't force this, but >>>the NFS protocol will >> >>UDP has a checksum as well, though it's only 16b. then again, the TCP >>checksum isn't all that strong for today's data rates either. > > From reading the man page on nfs on the systems with the 2.4 > kernels, it looks like the default for an nfs mount is udp. It > also looks like tcp is not really an option until nfs v4, so it may > be something to try on the 2.6 kernels that I have on some of my > newer machines at another site. > >>you should definitely examine /proc/net/dev on involved machines. > >I hadn't known about /proc/net/dev. When I check there, I see no >transmit errors on the server side and no receive errors on the >client side. That's odd, because the other thing I see is that the >average packet size received (bytes received divided by packets >received) on the client side is 3.9, while on the server side, the >average packet size sent is 1430. In other words, there are a many >more packets received than there ought to be. That's very >fishy. It's probably the result of the way the packet count is done >and reported. I.e., it may be that all the received packets -- good >and bad -- are counted, but only the bytes in the good ones are >counted, with some similar problem on the server side. I think the >statistics are aggregate since the last boot, so they may not be >just from the troublesome tests I was performing, either. > >>I would attempt to reduce the complexity of your testing. >>for instance, can a node write and verify to its local disk >>without problem? > >The local disk read seems rock solid in comparison to the NFS >one. The local md5sum produces the same result time after time, >which is just not the case for the remote. > >>can it stream data over tcp sockets (netcat or the like) without >>corruption or obvious problems reflected >>in /proc/net/dev? > >netcat is not on my systems. Looks like I have to get someone to >download and build it for me, and try the streaming tests you recommend. > >>does ethtool tell you anything about the config of the nic? > >Not on the 2.4 systems, though it seems to tell me a little on the 2.6's. > >>comparing tcp vs udp NFS would be sensible >>as well - varying the packet size, too. switching client and/or >>server to a modern 2.6 kernel may be instructive. > >Upgrading the kernel is probably the only way I'll get nfs over >tcp. Given that these systems are headed out the door, I'm not sure >that's a good use of our time. But it may be worth doing an our new >and newer systems. > >Thanks again! > > >Mike > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf From jlb17 at duke.edu Wed Dec 5 09:26:20 2007 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Sat Oct 11 01:06:45 2008 Subject: [Beowulf] NFS Read Errors In-Reply-To: <6.2.5.6.2.20071205085308.04eff510@NumerEx.com> References: <6.2.5.6.2.20071128132559.04fcbba8@NumerEx.com> <4754ABA1.9030105@scalableinformatics.com> <6.2.5.6.2.20071204085359.04f72018@NumerEx.com> <6.2.5.6.2.20071205085308.04eff510@NumerEx.com> Message-ID: On Wed, 5 Dec 2007 at 9:55am, Michael H. Frese wrote > Clearly, this puts a premium on using tcp for nfs. All our attempts to do > that failed. Well, both of them, anyway. In the first one, we unmounted the > offending disk, modified its fstab entry, and remounted it. We were pretty > careful in the second one, where we added tcp to the fstab argument, > unmounted all the remote disks, restarted all the nfsd's, and did 'mount -a'. > We got an error message in both cases that didn't obviously refer to the tcp > argument, but the mount didn't happen. As I write this, I see references to > tcp mount requests in the mountd man page, so maybe we need to do a bit more > here. > > The Wikipedia article on nfs says this: "At the time of introduction of > Version 3, vendor support for TCP as a transport-layer protocol began > increasing. While several vendors had already added support for NFS Version 2 > with TCP as a transport, Sun Microsystems added support for TCP as a > transport for NFS at the same time it added support for Version 3." > > I'd like to know what version of nfs this server supports, but the man page > on nfsd doesn't say. The man page on rpc.mountd says that it supports nfs > version 2 and version 3, but that "If the NFS kernel module was compiled > without support for NFSv3, rpc.mountd must be invoked with the option > --no-nfs-version 3." Yet the /proc/procnum/cmdline for the running > rpc.mountd doesn't show a --no-nfs-version argument. Clearly, both the > kernel and the server need to support the use of tcp. Looking back through this thread, I don't see any details on the NFS server, only the clients. What are the hardware and OS version of the NFS server? Grepping through the kernel config for RH9 shows it definitely did not support NFS over TCP *as a server*. If your server is newer, though, and does support a TCP nfsd, then you may have to look at other stuff (firewalls rules, TCP wrappers, etc) as to why the TCP mounts didn't work. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF From Michael.Frese at NumerEx.com Wed Dec 5 09:57:23 2007 From: Michael.Frese at NumerEx.com (Michael H. Frese) Date: Sat Oct 11 01:06:45 2008 Subject: [Beowulf] NFS Read Errors In-Reply-To: References: <6.2.5.6.2.20071128132559.04fcbba8@NumerEx.com> <4754ABA1.9030105@scalableinformatics.com> <6.2.5.6.2.20071204085359.04f72018@NumerEx.com> <6.2.5.6.2.20071205085308.04eff510@NumerEx.com> Message-ID: <6.2.5.6.2.20071205105246.04f3e988@NumerEx.com> Joshua, Thanks for the info on the nfs server in RH9. We are using that distro unmodified out of the box, so to speak, so that is clearly blocks any possibility for fixing the problem in software. As for the hardware, it was described earlier as follows: [Old hardware generally: AMD Athlon 32-bit single (MSI KT4V) and dual (Tyan...) chip motherboards both running Redhat 9 one with the 2.4.20-8 kernels, though one is the smp version; NetGear GA311 NICs; and a NetGear GS108 8 port Copper 1 GB/s switch. The single processor motherboards have 32-bit PCI slots so their network speeds are limited to 300 kbps as shown by netpipe. All of the LEDs at the ends of the cables show 1000Mb connections.] Thanks again for your help. Mike At 10:26 AM 12/5/2007, Joshua Baker-LePain wrote: >On Wed, 5 Dec 2007 at 9:55am, Michael H. Frese wrote > >>Clearly, this puts a premium on using tcp for nfs. All our >>attempts to do that failed. Well, both of them, anyway. In the >>first one, we unmounted the offending disk, modified its fstab >>entry, and remounted it. We were pretty careful in the second one, >>where we added tcp to the fstab argument, unmounted all the remote >>disks, restarted all the nfsd's, and did 'mount -a'. We got an >>error message in both cases that didn't obviously refer to the tcp >>argument, but the mount didn't happen. As I write this, I see >>references to tcp mount requests in the mountd man page, so maybe >>we need to do a bit more here. >> >>The Wikipedia article on nfs says this: "At the time of >>introduction of Version 3, vendor support for TCP as a >>transport-layer protocol began increasing. While several vendors >>had already added support for NFS Version 2 with TCP as a >>transport, Sun Microsystems added support for TCP as a transport >>for NFS at the same time it added support for Version 3." >> >>I'd like to know what version of nfs this server supports, but the >>man page on nfsd doesn't say. The man page on rpc.mountd says that >>it supports nfs version 2 and version 3, but that "If the NFS >>kernel module was compiled without support for NFSv3, rpc.mountd >>must be invoked with the option --no-nfs-version 3." Yet the >>/proc/procnum/cmdline for the running rpc.mountd doesn't show a >>--no-nfs-version argument. Clearly, both the kernel and the server >>need to support the use of tcp. > >Looking back through this thread, I don't see any details on the NFS >server, only the clients. What are the hardware and OS version of >the NFS server? > >Grepping through the kernel config for RH9 shows it definitely did >not support NFS over TCP *as a server*. If your server is newer, >though, and does support a TCP nfsd, then you may have to look at >other stuff (firewalls rules, TCP wrappers, etc) as to why the TCP >mounts didn't work. > >-- >Joshua Baker-LePain >QB3 Shared Cluster Sysadmin >UCSF >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf From mg.mailing-list at laposte.net Wed Dec 5 00:15:17 2007 From: mg.mailing-list at laposte.net (Mathieu Gontier) Date: Sat Oct 11 01:06:45 2008 Subject: [Beowulf] use a MPI library thought a shared library In-Reply-To: References: <4755A486.5000109@laposte.net> Message-ID: <47565E15.8090502@laposte.net> Sorry. Indeed, the included should not be here: it is a relic of some flags added to understand the problem. Then, the test case is correct without this include. So, Peter, you well understand morphmpi.h ;-) Mathieu Gontier Core Development Engineer Read the attached v-card for telephone, fax, adress Look at our web-site http://www.fft.be Peter St. John wrote: > Mathieu, > I didn't spot why you included ? It seems you work thru > morph_mpi.h wrappers, right? Perhaps I misunderstand? > Peter > > On Dec 4, 2007 2:03 PM, Mathieu Gontier > wrote: > > Hi all, > > I am currently working with a project named MorphMPI. Its main purpose > is to offer a generic interface for the developers of parallel > applications, and chose the MPI library/interconnect at the runtime by > rebuilding a shared morph library against the desire MPI library. (The > final application is linked against a shared morph library instead of > the real MPI library.) > For more information about that, you can follow these links: > - http://www.clustermonkey.net//content/view/213/32/ > > - http://sourceforge.net/projects/morphmpi > > So, I meet a little problem whatever the MPI library used (I tried > with > MPICH-1.2.5.2, MPICHGM and IntelMPI). > When MorphMPI is linked statically with my parallel application, > everything is ok; but when MorphMPI is linked dynamically with my > parallel application, MPI_Get_count return a wrong value. > > I concluded it is difficult to use a MPI library thought a shared > library. I wonder if someone have more information about it (in this > case, you're welcome ;-) ) > > Thank you for your support, > Mathieu. > > PS: my problem happens in the the following example, > > # include > > # include > > #include > > > int main( int argc, char* argv[] ) > > { > > int np, me, ier, flag=0, msglen=-1 ; > > MorphMPI_Request request ; > > MorphMPI_Status status ; > > int buf[1] ; buf[0]=-1 ; > > > ier = MorphMPI_Init( &argc, &argv ) ; > > ier = MorphMPI_Comm_size( MorphMPI_COMM_WORLD, &np ) ; > > ier = MorphMPI_Comm_rank( MorphMPI_COMM_WORLD, &me ) ; > > > if( me > 1 ) printf( "I am the useless processor #%d on %d\n", > me, np ) ; > > else printf( "I am the working processor #%d on %d\n", me, np ) ; > > > ier = MorphMPI_Barrier( MorphMPI_COMM_WORLD ) ; > > > printf( "<<< %d >>>\n", &status ) ; > > > if( ! me ) { > > buf[0] = 69 ; > > ier = MorphMPI_Isend( buf, 1, MorphMPI_INT, 1,1, > MorphMPI_COMM_WORLD, &request ) ; > > ier = MorphMPI_Wait( &request, &status ) ; > > } > > > ier = MorphMPI_Barrier( MorphMPI_COMM_WORLD ) ; > > > if( me == 1 ) { > > ier = MorphMPI_Irecv( buf, 1, MorphMPI_INT, 0, 1, > MorphMPI_COMM_WORLD, &request ) ; > > ier = MorphMPI_Wait( &request, &status ) ; > > ier = MorphMPI_Get_count( &status, MorphMPI_INT, &msglen ) ; > > > if( msglen != 1 ) printf( "ERROR: The lengh of the message is > not 1\n" ) ; > > else printf( "SUCCESS !\n" ) ; > > } > > > ier = MorphMPI_Finalize() ; > > } > > > > -- > Mathieu Gontier > Core Development Engineer > > Read the attached v-card for telephone, fax, adress > Look at our web-site http://www.fft.be > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > From mg.mailing-list at laposte.net Wed Dec 5 00:28:05 2007 From: mg.mailing-list at laposte.net (Mathieu Gontier) Date: Sat Oct 11 01:06:45 2008 Subject: [Beowulf] use a MPI library thought a shared library In-Reply-To: <4755CB51.5050802@scalableinformatics.com> References: <4755A486.5000109@laposte.net> <4755CB51.5050802@scalableinformatics.com> Message-ID: <47566115.5000009@laposte.net> Yep, I use ldd every days. But here the problem comes from a corrupted structure in MorphMPI and MPI typedef struct{ int MorphMPI_SOURCE; int MorphMPI_TAG; int MorphMPI_ERROR; void* mpi_status ; } MorphMPI_Status ; Where the attribut mpi_status is used to point a real MPI_Status. In MPICH: typedef struct{ int MPI_SOURCE; int MPI_TAG; int MPI_ERROR; int count ; } MPI_Status ; Then, when my MorphMPI_Status is given to MorphMPI_Get_count(), the attribut MorphMPI_Status::mpi_status is not corrupted but MorphMPI_Status::mpi_status::count is corrupted: the value should be 4 and not "random". I tried to manipulate the structure MorphMPI_Status (add another integer to align it in 64-bits, only have the void*,...) without success. As reminder, this problem appears only when the MPI is used through a dynamic linked MorphMPI library. Does someone have an idea? Mathieu Gontier Core Development Engineer Read the attached v-card for telephone, fax, adress Look at our web-site http://www.fft.be Joe Landman wrote: > Greetings Mathieu: > > Mathieu Gontier wrote: > > [...] > >> So, I meet a little problem whatever the MPI library used (I tried >> with MPICH-1.2.5.2, MPICHGM and IntelMPI). >> When MorphMPI is linked statically with my parallel application, >> everything is ok; but when MorphMPI is linked dynamically with my >> parallel application, MPI_Get_count return a wrong value. >> >> I concluded it is difficult to use a MPI library thought a shared >> library. I wonder if someone have more information about it (in this > > Not likely. I would suggest ldd. It is your friend. > > For example: > > joe@pegasus-i:~/workspace/source-mpi$ ldd matmul_mpi_3.exe > libm.so.6 => /lib/libm.so.6 (0x00002b5409d17000) > libmpi.so.0 => not found > libopen-rte.so.0 => not found > libopen-pal.so.0 => not found > librt.so.1 => /lib/librt.so.1 (0x00002b5409f99000) > libdl.so.2 => /lib/libdl.so.2 (0x00002b540a1a2000) > libnsl.so.1 => /lib/libnsl.so.1 (0x00002b540a3a6000) > libutil.so.1 => /lib/libutil.so.1 (0x00002b540a5c0000) > libpthread.so.0 => /lib/libpthread.so.0 (0x00002b540a7c3000) > libc.so.6 => /lib/libc.so.6 (0x00002b540a9de000) > /lib64/ld-linux-x86-64.so.2 (0x00002b5409af9000) > > Notice that libmpi.so.0 is not found, so I can't run this by hand. > Unless I force the issue using LD_LIBRARY_PATH > > joe@pegasus-i:~/workspace/source-mpi$ export > LD_LIBRARY_PATH="/home/joe/local/lib64/:/home/joe/local/lib/" > joe@pegasus-i:~/workspace/source-mpi$ ldd matmul_mpi_3.exe > libm.so.6 => /lib/libm.so.6 (0x00002ae35ca50000) > libmpi.so.0 => /home/joe/local/lib/libmpi.so.0 > (0x00002ae35ccd1000) > libopen-rte.so.0 => /home/joe/local/lib/libopen-rte.so.0 > (0x00002ae35cfe8000) > libopen-pal.so.0 => /home/joe/local/lib/libopen-pal.so.0 > (0x00002ae35d2b3000) > librt.so.1 => /lib/librt.so.1 (0x00002ae35d514000) > libdl.so.2 => /lib/libdl.so.2 (0x00002ae35d71d000) > libnsl.so.1 => /lib/libnsl.so.1 (0x00002ae35d921000) > libutil.so.1 => /lib/libutil.so.1 (0x00002ae35db3b000) > libpthread.so.0 => /lib/libpthread.so.0 (0x00002ae35dd3e000) > libc.so.6 => /lib/libc.so.6 (0x00002ae35df59000) > /lib64/ld-linux-x86-64.so.2 (0x00002ae35c832000) > > and it might even run ... > > joe@pegasus-i:~/workspace/source-mpi$ ./matmul_mpi_3.exe > D[tid=0]: running on machine = pegasus-i > D: checking arguments: N_args=1 > D: arg[0] = ./matmul_mpi_3.exe > Allocating memory ... > array size in MB = 7.629 MB > (remember, you have 2 of these)normalization a: 0.05510, b: 0.00173 > 0 : loop_min = 0, loop_max = 1000 > ... > > Do you have some sort of LD_LIBRARY_PATH set up? Or something set in > /etc/ld.so.config that points to where these things are? Remember, > mpirun/mpiexec's alternative purpose in life is to set up the correct > run time environment for you, so you might want to see what is going > on with the environment in your equivalent command. > > From gdjacobs at gmail.com Wed Dec 5 13:09:33 2007 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Sat Oct 11 01:06:45 2008 Subject: [Beowulf] BIOS In-Reply-To: References: <20070809141520.GA605@gretchen.aei.uni-hannover.de> <46BEA3B7.4010806@aei.mpg.de> Message-ID: <4757138D.6070901@gmail.com> Bruno Coutinho wrote: > > > 2007/8/12, Robert G. Brown >: > > On Sun, 12 Aug 2007, Carsten Aulbert wrote: > > > Thanks for the link. In principle we have everything working already > > that way, but want to "excel" a bit more: > > No, no, no. You want to "ooffice" a little more...;-) > > > > > (1) Right now we use memdisk from the syslinux/isolinux family to boot > > the dos image. Booting an exact floppy image works fine, but for some > > part in (2) we might need more space than a 2,88 MB floppy or its > > extended pendant gives to us. Thus we are currently trying to boot > a hd > > image which sems to be a bit trickier than a simple floppy image > > (getting boot code, partition table right for example). > > > > (2) We want to have some feedback from the process and don't want to > > have an automatic reboot after a possible failure because in the worst > > case this might "brickify" a node. Once I had the problem, that > > automatic BIOS flashing worked, but one node - which looked > similar but > > behaved differently - was not able to finish the flashing procedure > > successfully. Since I was monitoring the node I was able to redo the > > flashing with a different option [1]. > > > > Anyway, that's the reason why we want to include a dhcp client and > some > > means, possibly a ssh or rsh client along with the needed packetdriver > > to the image and notify the server that way, that it successfully > > flashed the BIOS and set our custom settings correctly. Only after > that > > the nodes should continue FAIing. > > No, that's reasonable -- I just didn't understand. Autoexec.bat is > dumb > as a post in comparison even with /bin/sh, too. > > > It's dumb, but not so dumb. :-) > Th syntax is crappy but it has this feature: > http://www.robvanderwoude.com/errorlevel.html > > OBS: REM is a comment initiator like #. DOS batch files were actually surprisingly capable. It's just that what they could do was not as well documented as, for example, Bash is today. -- Geoffrey D. Jacobs From Bogdan.Costescu at iwr.uni-heidelberg.de Thu Dec 6 09:37:11 2007 From: Bogdan.Costescu at iwr.uni-heidelberg.de (Bogdan Costescu) Date: Sat Oct 11 01:06:45 2008 Subject: [Beowulf] Re: NFS Read Errors In-Reply-To: <4755C518.5070409@scalableinformatics.com> References: <4755C518.5070409@scalableinformatics.com> Message-ID: On Tue, 4 Dec 2007, Joe Landman wrote: > a) bad driver > b) bad NIC ... or a combination of these which translates into RX and/or TX checksumming offload not working properly; the driver then lies to the upper levels and the error is passed through. I don't remember if this was even possible at the time of RHL9, but try to run: ethtool -k ethX and if any of the checksums are turned on, you can turn them off with: ethtool -K ethX rx off ethtool -K ethX tx off (note: ethtool might not be installed by default, check the install media if there was a package with this name) -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu@IWR.Uni-Heidelberg.De From landman at scalableinformatics.com Thu Dec 6 14:18:12 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Sat Oct 11 01:06:45 2008 Subject: [Beowulf] Re: NFS Read Errors In-Reply-To: References: <4755C518.5070409@scalableinformatics.com> Message-ID: <47587524.3070107@scalableinformatics.com> Bogdan Costescu wrote: > On Tue, 4 Dec 2007, Joe Landman wrote: > >> a) bad driver >> b) bad NIC > > ... or a combination of these which translates into RX and/or TX > checksumming offload not working properly; the driver then lies to the > upper levels and the error is passed through. I don't remember if this > was even possible at the time of RHL9, but try to run: I think ethtool was a post 2.4 kernel utility. As I remember, there was an miitool that gave something roughly like that in functionality. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From Bogdan.Costescu at iwr.uni-heidelberg.de Thu Dec 6 14:44:14 2007 From: Bogdan.Costescu at iwr.uni-heidelberg.de (Bogdan Costescu) Date: Sat Oct 11 01:06:45 2008 Subject: [Beowulf] Re: NFS Read Errors In-Reply-To: <47587524.3070107@scalableinformatics.com> References: <4755C518.5070409@scalableinformatics.com> <47587524.3070107@scalableinformatics.com> Message-ID: On Thu, 6 Dec 2007, Joe Landman wrote: > I think ethtool was a post 2.4 kernel utility. As I remember, there was an > miitool that gave something roughly like that in functionality. I just looked in the pristine 2.4.20 source and found 8139cp with references to the 8169 chip used on the Netgear GA311 and a routine called "cp_ethtool_ioctl" with switch statements for RX and TX checksumming... whether ethtool w