From Hakon.Bugge at scali.com Fri Feb 1 04:52:15 2008 From: Hakon.Bugge at scali.com (=?iso-8859-1?Q?H=E5kon?= Bugge) Date: Thu Aug 28 01:06:48 2008 Subject: [Beowulf] Cheap SDR IB In-Reply-To: References: <200801302001.m0UK0UCS015867@bluewest.scyld.com> <20080131095052.EB94635B03D@mail.scali.no> Message-ID: <20080201125216.100F235B13F@mail.scali.no> Mark, At 15:09 31.01.2008, Mark Hahn wrote: >I did not claim the opposite - I said that for small, cost-sensitive >clusters, it would be unusual to need IB's advantages (high bandwidth >and latency comparable to other non-Gb interconnects.) > >in particular, I'm curious about the conventional wisdom about weather codes >and bandwidth. k >I was curious about this: you only used one DDR port; was that because >of lack of switch ports, or because WRF uses bandwidth <= DDR? The system is a general purpose benchmarking system; not particularly crafted for running WRF. Based on a slightly apples-to-oranges comparison, you will see that QLogic's SPEC MPI2007 submission contains a WRF number (374s) which is _very_ similar to what I reported. This is an indiction that WRF on this system / dataset is not restricted by SDR bandwidth (also, for the record, this is a slightly mix of compilers, Pathscale 3.0 and Intel 9.1, - but they both do a decent job on WRF). >sure, and these are very fat nodes for which a fat interconnect is >appropriate for almost any workload that's not embarassing. but really >I wasn't suggesting that plain old Gb (bandwidth in particular) was >adequate for all possible clusters. I was questioning whether IB >was a panacea for small, cost-sensitive ones... I do not agree that dual-socket, dual-core Woodcrest nodes these days are "very fat". A quad-socket, quad-core is. A quad-socket, dual-core or a dual-socket, quad-core might be considered semi-fat... Hakon From vanallsburg at hope.edu Fri Feb 1 09:06:02 2008 From: vanallsburg at hope.edu (Paul Van Allsburg) Date: Thu Aug 28 01:06:48 2008 Subject: [Beowulf] weather modeling cluster Message-ID: <47A3517A.5050100@hope.edu> All, I'm interested in setting up a open source weather modeling cluster in an educational environment. My existing clusters run chemistry, math and bio applications and I don't know what weather app would be a good choice for a first time effort. Thanks for any input that may help me get my feet wet... Paul -- Paul Van Allsburg Computational Science & Modeling Facilitator Natural Sciences Division, Hope College 35 East 12th Street Holland, Michigan 49423 616-395-7292 http://www.hope.edu/academic/csm/ From john.leidel at gmail.com Fri Feb 1 09:29:24 2008 From: john.leidel at gmail.com (John Leidel) Date: Thu Aug 28 01:06:48 2008 Subject: [Beowulf] weather modeling cluster In-Reply-To: <47A3517A.5050100@hope.edu> References: <47A3517A.5050100@hope.edu> Message-ID: <1201886964.17107.9.camel@e521.site> Check out WRF: Weather Research and Forecasting Model http://www.wrf-model.org/index.php On Fri, 2008-02-01 at 12:06 -0500, Paul Van Allsburg wrote: > All, > I'm interested in setting up a open source weather modeling cluster in > an educational environment. My existing clusters run chemistry, math and > bio applications and I don't know what weather app would be a good > choice for a first time effort. Thanks for any input that may help me > get my feet wet... > > Paul > > From gerry.creager at tamu.edu Fri Feb 1 10:04:25 2008 From: gerry.creager at tamu.edu (Gerry Creager) Date: Thu Aug 28 01:06:48 2008 Subject: [Beowulf] weather modeling cluster In-Reply-To: <1201886964.17107.9.camel@e521.site> References: <47A3517A.5050100@hope.edu> <1201886964.17107.9.camel@e521.site> Message-ID: <47A35F29.7080609@tamu.edu> I'll second WRF for a good starting point for weather codes. Feel free to drop me a line if you need some suggestions with it. Also: Plan to send someone to the summer WRF tutorial in Boulder, where they'll get good info to bring things up right. gerry John Leidel wrote: > Check out WRF: Weather Research and Forecasting Model > > http://www.wrf-model.org/index.php > > > > On Fri, 2008-02-01 at 12:06 -0500, Paul Van Allsburg wrote: >> All, >> I'm interested in setting up a open source weather modeling cluster in >> an educational environment. My existing clusters run chemistry, math and >> bio applications and I don't know what weather app would be a good >> choice for a first time effort. Thanks for any input that may help me >> get my feet wet... >> >> Paul >> >> > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From ionsourcerer at mac.com Fri Feb 1 08:45:35 2008 From: ionsourcerer at mac.com (Rick Becker) Date: Thu Aug 28 01:06:48 2008 Subject: [Beowulf] unsubscribe Message-ID: <4FE37EFA-17EF-4B3E-BACB-96436E358A62@mac.com> Rick Becker Cluster Sciences Borolene Metamaterials 39 Topsfield Rd. Ipswich, MA 01938 US 978-337-9009 ionsourcerer@mac.com If you do not know where you are going, call it "exploration". If you do not know what you are doing, call it "research". -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20080201/379da985/attachment.html From landman at scalableinformatics.com Fri Feb 1 11:33:12 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Thu Aug 28 01:06:48 2008 Subject: [Beowulf] Cheap SDR IB In-Reply-To: <47A228F2.1070309@physics.isu.edu> References: <9FA59C95FFCBB34EA5E42C1A8573784FE5A683@mtiexch01.mti.com> <47A228F2.1070309@physics.isu.edu> Message-ID: <47A373F8.9050303@scalableinformatics.com> Brian Oborn wrote: > A quick side question. Is it possible to use IB as a cross-over with no > switch? If I had just 2 fat nodes could I connect the HCAs directly to Yes. > each other and avoid the switch costs? Could this be extended to ring or > hypercube topologies? Yeah ... but a switch rapidly makes sense. One link going down would "quench" your ring ala FDDI. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From Daniel.Pfenniger at obs.unige.ch Fri Feb 1 11:36:47 2008 From: Daniel.Pfenniger at obs.unige.ch (Daniel Pfenniger) Date: Thu Aug 28 01:06:48 2008 Subject: [Beowulf] Cheap SDR IB In-Reply-To: <47A228F2.1070309@physics.isu.edu> References: <9FA59C95FFCBB34EA5E42C1A8573784FE5A683@mtiexch01.mti.com> <47A228F2.1070309@physics.isu.edu> Message-ID: <47A374CF.9030901@obs.unige.ch> Brian Oborn wrote: > .... > A quick side question. Is it possible to use IB as a cross-over with no > switch? Yes, and the cables are the same. > If I had just 2 fat nodes could I connect the HCAs directly to > each other and avoid the switch costs? Yes. With 3 nodes it might be cheaper having 2 HCA per node, 6 cables, than the switch solution. Dan From hahn at mcmaster.ca Fri Feb 1 12:10:38 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu Aug 28 01:06:48 2008 Subject: [Beowulf] Cheap SDR IB In-Reply-To: <47A374CF.9030901@obs.unige.ch> References: <9FA59C95FFCBB34EA5E42C1A8573784FE5A683@mtiexch01.mti.com> <47A228F2.1070309@physics.isu.edu> <47A374CF.9030901@obs.unige.ch> Message-ID: >> If I had just 2 fat nodes could I connect the HCAs directly to each other >> and avoid the switch costs? > > Yes. With 3 nodes it might be cheaper having 2 HCA per node, 6 cables, > than the switch solution. with 3 nodes, each with a two ports, wouldn't you need just 3 cables? how is routing controlled in switchless configs? does IB have node-level forwarding? From daniel.pfenniger at obs.unige.ch Fri Feb 1 14:56:35 2008 From: daniel.pfenniger at obs.unige.ch (Pfenniger Daniel) Date: Thu Aug 28 01:06:48 2008 Subject: [Beowulf] Cheap SDR IB In-Reply-To: References: <9FA59C95FFCBB34EA5E42C1A8573784FE5A683@mtiexch01.mti.com> <47A228F2.1070309@physics.isu.edu> <47A374CF.9030901@obs.unige.ch> Message-ID: <47A3A3A3.8070203@obs.unige.ch> Mark Hahn wrote: >>> If I had just 2 fat nodes could I connect the HCAs directly to each >>> other and avoid the switch costs? >> >> Yes. With 3 nodes it might be cheaper having 2 HCA per node, 6 cables, >> than the switch solution. > > with 3 nodes, each with a two ports, > wouldn't you need just 3 cables? > Yes, I forgot to divide by 2! Some HCA have 2 ports, so they would be indicated for a 3 node switchless cluster. Dan From kilian at stanford.edu Fri Feb 1 17:44:54 2008 From: kilian at stanford.edu (Kilian CAVALOTTI) Date: Thu Aug 28 01:06:48 2008 Subject: [Beowulf] Cheap SDR IB In-Reply-To: References: <9FA59C95FFCBB34EA5E42C1A8573784FE5A683@mtiexch01.mti.com> <47A374CF.9030901@obs.unige.ch> Message-ID: <200802011744.55007.kilian@stanford.edu> Hi Mark, On Friday 01 February 2008 12:10:38 pm Mark Hahn wrote: > how is routing controlled in switchless configs? It's not. :) > does IB have node-level forwarding? No, you can't forward traffic between non-directly connected nodes in such a ring topology (without any switch). You would need intra-node routing mechanisms which are not present in OFED. I don't know in other implementations, though. Besides, for each cross-over pair, you'll be creating a separate subnet, and each subnet requires its own subnet manager. However, in a 3-nodes ring, each node can directly connect to all the other ones, and strictly speaking, you only have 2 subnets. So I guess node-level fowarding is not an issue, and that's probably a viable solution. Cheers, -- Kilian From 3lucid at gmail.com Sun Feb 3 10:35:02 2008 From: 3lucid at gmail.com (Kyle Spaans) Date: Thu Aug 28 01:06:48 2008 Subject: [Beowulf] TIPC in a Beowulf? Message-ID: <5a1205b30802031035wf416840s115b6a5278c20812@mail.gmail.com> Has anyone heard of or seen TIPC used in a Beowulf Cluster? Some folks from Wind River (creators of the protocol I think) came and gave a talk about it at my school. They said it can be used over IP, or even on it's own through ethernet, and would even work with myrinet or infiniband with proper drivers. I'm still not very familiar with programming a Beowulf, but Inter Process Communication is an equally viable paradigm just like Message Passing, right? From hahn at mcmaster.ca Sun Feb 3 14:42:56 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] TIPC in a Beowulf? In-Reply-To: <5a1205b30802031035wf416840s115b6a5278c20812@mail.gmail.com> References: <5a1205b30802031035wf416840s115b6a5278c20812@mail.gmail.com> Message-ID: > Has anyone heard of or seen TIPC used in a > Beowulf Cluster? I haven't. I sat in on tipc meetings at OLS a few times, and have the impression that TIPC people are much more into telecom/footprint issues rather than HPC. (and yes, I believe these are very different focuses - for HPC, the main issue is latency (since bandwidth is not that hard.)) I _think_ I'm not confusing TIPC with SCTP (which also seems to be rather telecom-oriented.) here are some kind of shocking performance measures: http://www.strlen.de/tipc/ no mention of latency there. > Some folks from Wind River (creators of the protocol I think) came and > gave a talk about it at my school. They said it can be used over IP, > or even on it's own through ethernet, and would even work with myrinet > or infiniband with proper drivers. well, TIPC is trying to do a lot that TCP isn't. for instance, I think it's trying to do fairly full group membership as well as topology-aware routing. I'm not sure these are as critical to HPC-type clustering as they would be for HA-type clustering. I'm also a bit skeptical of a protocol that aims to put everything into one kernel-resident layer... > I'm still not very familiar with programming a Beowulf, but Inter > Process Communication is an equally viable paradigm just like Message > Passing, right? TIPC is a form of MP. don't confuse MP with MPI! MPI is important and widespread, but I don't think many people would say that it's perfect. MPI-over TCP in particular is kind of a shame, since TCP is really a protocol designed for flakey, overloaded, heterogenous WANs, not the kind of dedicated, homogenous, flat network you find in an HPC cluster. I'm looking forward to OpenMX - it's a message-passing layer amenable to ethernet, but well-suited for MPI. any OpenMX people care to comment? regards, mark hahn. From steve_heaton at exemail.com.au Fri Feb 1 13:56:48 2008 From: steve_heaton at exemail.com.au (Particle Boy) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] Re: weather modeling cluster In-Reply-To: <200802011935.m11JYqvd026938@bluewest.scyld.com> References: <200802011935.m11JYqvd026938@bluewest.scyld.com> Message-ID: <47A395A0.3080809@exemail.com.au> I'll also recommend the WRF. The WRF EMS kit from STRC at UCAR is a Good Thing: http://strc.comet.ucar.edu/wrf/index.htm I got it going quickly and relatively easily without any prior experience of large models. Bob R has an entertaining sense of humour (you'll see from the scripts) and was also kind enough to quickly send me a starter kit all the way down here to Oz. Great service :) If your students are looking to tinker with code, I had a lot of fun with PUMA: http://puma.dkrz.de/puma Cheers Stevo > Date: Fri, 01 Feb 2008 12:06:02 -0500 > From: Paul Van Allsburg > Subject: [Beowulf] weather modeling cluster > To: Beowulf Mailing list > Message-ID: <47A3517A.5050100@hope.edu> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > All, > I'm interested in setting up a open source weather modeling cluster in > an educational environment. My existing clusters run chemistry, math and > bio applications and I don't know what weather app would be a good > choice for a first time effort. Thanks for any input that may help me > get my feet wet... > > Paul > > From wrankin at ee.duke.edu Mon Feb 4 10:35:00 2008 From: wrankin at ee.duke.edu (Bill Rankin) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] TIPC in a Beowulf? In-Reply-To: References: <5a1205b30802031035wf416840s115b6a5278c20812@mail.gmail.com> Message-ID: Hey Mark, > I'm looking forward to OpenMX - it's a message-passing layer > amenable to ethernet, but well-suited for MPI. any OpenMX people > care to comment? Do you have any links to the current status of this effort? All my Googling leads to links on a package (also called OpenMX) for nano- material simulations. Thanks for the info. -bill From hahn at mcmaster.ca Mon Feb 4 11:09:59 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] TIPC in a Beowulf? In-Reply-To: References: <5a1205b30802031035wf416840s115b6a5278c20812@mail.gmail.com> Message-ID: >> I'm looking forward to OpenMX - it's a message-passing layer amenable to >> ethernet, but well-suited for MPI. any OpenMX people care to comment? > > Do you have any links to the current status of this effort? All my Googling > leads to links on a package (also called OpenMX) for nano-material > simulations. unfortunately no. all I've had is cruel teasing messages from myricom-related people. "code-tease" ;) to me, it seems like this would be a fairly high priority for myricom, since it emphasizes the value of ethernet interop, whether 1Gb or 10Gb. From Brice.Goglin at inria.fr Mon Feb 4 11:17:23 2008 From: Brice.Goglin at inria.fr (Brice Goglin) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] TIPC in a Beowulf? In-Reply-To: References: <5a1205b30802031035wf416840s115b6a5278c20812@mail.gmail.com> Message-ID: <47A764C3.30301@inria.fr> Mark Hahn wrote: > I'm looking forward to OpenMX - it's a message-passing layer amenable > to ethernet, but well-suited for MPI. any OpenMX people care to comment? Hi, http://open-mx.org give a pretty good summary of the current status of Open-MX. The stack is plugged on top of the Ethernet layer in the Linux kernel to send/receive MX messages. The MX firmware is basically emulated in a kernel module without requiring any specific feature in the hardware. Release 0.3 is young but I am confident that it's not too bad. The performance still needs improvement but the stack is already reasonably stable. At least MPICH-MX and Open MPI build on top of it and complete IMB. I encourage people to test it and send some feedback. If you need more information, feel free to ask. Brice From werstiuk at platform.com Mon Feb 4 11:24:25 2008 From: werstiuk at platform.com (Nick Werstiuk) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] TIPC in a Beowulf? Message-ID: <531893A968B34D40B36C7A6445BC828A0127177D@catoexm06.noam.corp.platform.com> I came across this site that has some information on the project including access to the current version of the code, and a paper describing the approach. http://open-mx.gforge.inria.fr/ Regards, Nick -----Original Message----- From: beowulf-bounces@beowulf.org [mailto:beowulf-bounces@beowulf.org] On Behalf Of Mark Hahn Sent: Monday, February 04, 2008 2:10 PM To: Bill Rankin Cc: Beowulf List Subject: Re: [Beowulf] TIPC in a Beowulf? >> I'm looking forward to OpenMX - it's a message-passing layer amenable >> to ethernet, but well-suited for MPI. any OpenMX people care to comment? > > Do you have any links to the current status of this effort? All my > Googling leads to links on a package (also called OpenMX) for > nano-material simulations. unfortunately no. all I've had is cruel teasing messages from myricom-related people. "code-tease" ;) to me, it seems like this would be a fairly high priority for myricom, since it emphasizes the value of ethernet interop, whether 1Gb or 10Gb. _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From ebiederm at xmission.com Tue Feb 5 05:59:15 2008 From: ebiederm at xmission.com (Eric W. Biederman) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] Re: Cheap SDR IB In-Reply-To: (David Mathog's message of "Wed, 30 Jan 2008 13:05:13 -0800") References: Message-ID: "David Mathog" writes: > Joe Landman wrote: >> Gilad Shainer wrote: >> >> >> IB for gaming? I have one ratio: 1e-1/3e-6. that's human >> >> reaction time versus IB latency. >> >> >> > >> > Oh yes... I guess you did not play for a long time. Did you? Talk >> > with someone who suffer from lagging and you will get the story, even >> > When he has a great video card. It's the network and the CPU overhead >> > that are the cause of this issue >> >> Er... ah ... yeah. Milliseconds is typical in FPS games. hundreds of >> ms are bad. Hundreds of microseconds aren't ... ok, depends upon your >> FPS, I am sure the military folks have *really* fun ones which require >> that sort of latency. > > Many FPS games are still keyboard driven, and the scan rate on the > keyboard is likely only on the order of 10Hz. Gaming mice scan position > a lot faster though, last I looked they were closing in on 10000 data > points per second. Even so, human reaction time is now, and probably > will be forever, at the .1 second level, so even if that gaming mouse > could record 1000 button presses a second, no gamer is ever going to be > able to push that button at anywhere near that rate. > > IB would be massive overkill for gaming, 100 (or even 10) baseT should > work just fine unless the network is hideously congested, in which case > the game is probably going to become unplayable due to dropped UDP packets. Spin it the other way. Scale your online gaming server cluster using IB, and you probably have something. Eric From rgb at phy.duke.edu Tue Feb 5 06:53:10 2008 From: rgb at phy.duke.edu (Robert G. Brown) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] Re: Cheap SDR IB In-Reply-To: References: Message-ID: On Tue, 5 Feb 2008, Eric W. Biederman wrote: > Spin it the other way. Scale your online gaming server cluster using IB, > and you probably have something. And they may well do this. There are a lot of problems in provisioning online MMRPGs with "Universes" that are shared with HPC clusters and with HA clusters. Most of the sane ones spin off the actual rendering onto the clients, but they are still responsible for managing a huge inventory of objects as well as all the NPCs, in realtime interaction with PCs, in a large distributed "space". In some cases e.g. WoW the space has some fairly obvious boundaries -- different continents are plausibly on different servers in a realm cluster, ditto instances, where there are clear "cuts" when your character is "moved" from one server to another. They may even partition continents, but to do that (and manage a smooth passage across "country" boundaries) they need bottlenecks to limit traffic and a region of real-time overlap where characters are maintained (as it were) on both servers. Here IB or gigE would be very useful. It also might let them increase the fineness or granularity of boundaries, increase the server capacity for handling large numbers of simultaneous gamers by adding more physical servers to handle the large numbers of players that can occur in any given continent or country, and so on. Actually, MMRPGs are fun both to play and to think about as a cluster problem. But the big companies tend to be a bit chary of revealing their technology, although I have read a few articles on the subject. It is likely that many of the details of their implementations remain hidden. rgb > > Eric > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown Phone(cell): 1-919-280-8443 Duke University Physics Dept, Box 90305 Durham, N.C. 27708-0305 Web: http://www.phy.duke.edu/~rgb Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From eugen at leitl.org Tue Feb 5 08:13:28 2008 From: eugen at leitl.org (Eugen Leitl) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] Re: Cheap SDR IB In-Reply-To: References: Message-ID: <20080205161328.GJ10128@leitl.org> On Tue, Feb 05, 2008 at 09:53:10AM -0500, Robert G. Brown wrote: > And they may well do this. There are a lot of problems in provisioning > online MMRPGs with "Universes" that are shared with HPC clusters and > with HA clusters. Most of the sane ones spin off the actual rendering > onto the clients, but they are still responsible for managing a huge > inventory of objects as well as all the NPCs, in realtime interaction > with PCs, in a large distributed "space". In some cases e.g. WoW the The Second Life does the physics server-side. With the given technology, a region (one virtual server) will become sluggish (and soon herafter crash) after some 60-70 avatars frolick in the area. There's definitely potential for better interconnects and game clusters (deja vu, we must have discussed this some 5-8 years ago). > space has some fairly obvious boundaries -- different continents are > plausibly on different servers in a realm cluster, ditto instances, SL islands are rectangular boxes (the client used to crash spectacularly when altitude exceeded a signed short int). The world tesselates trivially on a 2d or 3rd grid/torus. > where there are clear "cuts" when your character is "moved" from one > server to another. They may even partition continents, but to do that > (and manage a smooth passage across "country" boundaries) they need > bottlenecks to limit traffic and a region of real-time overlap where > characters are maintained (as it were) on both servers. Here IB or gigE > would be very useful. It also might let them increase the fineness or > granularity of boundaries, increase the server capacity for handling > large numbers of simultaneous gamers by adding more physical servers > to handle the large numbers of players that can occur in any given > continent or country, and so on. To start with, writing distributed game servers with MPI would be a nice touch. I'm not aware of any effort which does it. > Actually, MMRPGs are fun both to play and to think about as a cluster > problem. But the big companies tend to be a bit chary of revealing > their technology, although I have read a few articles on the subject. > It is likely that many of the details of their implementations remain > hidden. -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE From rgb at phy.duke.edu Tue Feb 5 08:40:24 2008 From: rgb at phy.duke.edu (Robert G. Brown) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] Re: Cheap SDR IB In-Reply-To: <20080205161328.GJ10128@leitl.org> References: <20080205161328.GJ10128@leitl.org> Message-ID: On Tue, 5 Feb 2008, Eugen Leitl wrote: > On Tue, Feb 05, 2008 at 09:53:10AM -0500, Robert G. Brown wrote: > >> And they may well do this. There are a lot of problems in provisioning >> online MMRPGs with "Universes" that are shared with HPC clusters and >> with HA clusters. Most of the sane ones spin off the actual rendering >> onto the clients, but they are still responsible for managing a huge >> inventory of objects as well as all the NPCs, in realtime interaction >> with PCs, in a large distributed "space". In some cases e.g. WoW the > > The Second Life does the physics server-side. With the given technology, > a region (one virtual server) will become sluggish (and soon herafter > crash) after some 60-70 avatars frolick in the area. > > There's definitely potential for better interconnects and game > clusters (deja vu, we must have discussed this some 5-8 years ago). Yeah, and my experiences with 2ndL are highly negatory as a consequence. It is a bad cluster design. It does not scale. >> space has some fairly obvious boundaries -- different continents are >> plausibly on different servers in a realm cluster, ditto instances, > > SL islands are rectangular boxes (the client used to crash spectacularly > when altitude exceeded a signed short int). The world tesselates trivially > on a 2d or 3rd grid/torus. SL needs to adopt some of the technologies of other MMRPGs -- the ones that work. It makes the result more complex on the client side -- one has to update WoW every six months or so with new textures, maps, and display side bugfixes -- but it scales much, much better on the server side and is much less bottlenecked at the client side network (which may be "only" DSL). SL is a resource hog all around. I note that people are most impressed with it if they've never hung out in one of the well-designed, scalable worlds of WoW. Any player of WoW would laugh and cry to see how primitive, slow, clumsy it is. It has a good idea -- the ability of users to create objects and add them to the environment -- but it needs a much better algorithm for managing the construction process and an object-oriented look-ahead synchronization process that reduces the bottlenecks to something endurable. rgb -- Robert G. Brown Phone(cell): 1-919-280-8443 Duke University Physics Dept, Box 90305 Durham, N.C. 27708-0305 Web: http://www.phy.duke.edu/~rgb Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From wrankin at ee.duke.edu Tue Feb 5 09:21:53 2008 From: wrankin at ee.duke.edu (Bill Rankin) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] Re: Cheap SDR IB References: Message-ID: <4295A7EF-1D80-437D-AF11-4AF57958437D@ee.duke.edu> >> >> There's definitely potential for better interconnects and game >> clusters (deja vu, we must have discussed this some 5-8 years ago). > > Yeah, and my experiences with 2ndL are highly negatory as a > consequence. > It is a bad cluster design. It does not scale. I have not puttered around in SL for a while, but IIRC one of the "problems" is that SL allows the user to create their own fairly complex physical models and devices which is computationally restrictive when modeled on the server side and also bandwidth restricted when pushing the models out to the client. WoW, OTOH heavily restrict user customization which saves both server cycles as well as bandwidth. This does heavily restrict the user experience (which is one of the strengths of SL) but pays back in responsiveness. > SL needs to adopt some of the technologies of other MMRPGs -- the ones > that work. It makes the result more complex on the client side -- one > has to update WoW every six months or so with new textures, maps, and > display side bugfixes Actually, SL was going through a frequent update period for quite a while and I don't think it was any better than Wow in that respect. -b From rgb at phy.duke.edu Wed Feb 6 07:40:30 2008 From: rgb at phy.duke.edu (Robert G. Brown) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] PVM on wireless... Message-ID: Anybody on list have any idea why PVM fails to add hosts over a wireless link? I've now tried this over multiple distro version and at least one PVM update, and it just doesn't work. Works fine over a wire, fails on wireless, and as far as I know wire and wireless are both "identical" at the kernel interface layer so that any e.g. socket one might open is absolutely ecumenical about what the underlying hardware is (good old ISO/OSI layering, right?). And yes, I'm well aware that from a latency/bw point of view this arrangement isn't going to be a speed demon or scale terribly well, but for testing PVM from a laptop or writing code from a laptop or just playing with PVM itself for fun or profit from a laptop it would certainly be lovely if it WORKED, however poorly as far as IPCs are concerned. Yup, tried it one last time. Locks it right up it does, have to kill pvm[d] by hand and hand-remove the lockfiles, just like I did two or three years ago... rgb -- Robert G. Brown Phone(cell): 1-919-280-8443 Duke University Physics Dept, Box 90305 Durham, N.C. 27708-0305 Web: http://www.phy.duke.edu/~rgb Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From peter.st.john at gmail.com Wed Feb 6 08:34:14 2008 From: peter.st.john at gmail.com (Peter St. John) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] PVM on wireless... In-Reply-To: References: Message-ID: RGB, Are you using 3.4.5 re "improved use on Beowulf..."? I was thinking along the lines of the script invoking PVM doing something to reboot or refresh the network, and saw "New features in PVM 3.4.x include communication contexts...". I'd be happy to read a perl or Cish thing if an extra pair of eyes might notice something, but I don't know where to start. YMHS Peter On Feb 6, 2008 10:40 AM, Robert G. Brown wrote: > Anybody on list have any idea why PVM fails to add hosts over a wireless > link? I've now tried this over multiple distro version and at least one > PVM update, and it just doesn't work. Works fine over a wire, fails on > wireless, and as far as I know wire and wireless are both "identical" > at the kernel interface layer so that any e.g. socket one might open is > absolutely ecumenical about what the underlying hardware is (good old > ISO/OSI layering, right?). > > And yes, I'm well aware that from a latency/bw point of view this > arrangement isn't going to be a speed demon or scale terribly well, but > for testing PVM from a laptop or writing code from a laptop or just > playing with PVM itself for fun or profit from a laptop it would > certainly be lovely if it WORKED, however poorly as far as IPCs are > concerned. > > Yup, tried it one last time. Locks it right up it does, have to kill > pvm[d] by hand and hand-remove the lockfiles, just like I did two or > three years ago... > > rgb > > -- > Robert G. Brown Phone(cell): 1-919-280-8443 > Duke University Physics Dept, Box 90305 > Durham, N.C. 27708-0305 > Web: http://www.phy.duke.edu/~rgb > Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php > Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20080206/6dfed82c/attachment.html From wrankin at ee.duke.edu Wed Feb 6 09:33:11 2008 From: wrankin at ee.duke.edu (Bill Rankin) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] PVM on wireless... In-Reply-To: References: Message-ID: <02A63D14-3E34-4C0E-A012-D491922AC023@ee.duke.edu> Hey Rob, Could it be a node naming issue where the wireless IP does not resolve to the same address as that used in the machinefile? I seem to recall a similar issue back when we PVM on machines with multiple network connections. Just a thought, -bill On Feb 6, 2008, at 10:40 AM, Robert G. Brown wrote: > Anybody on list have any idea why PVM fails to add hosts over a > wireless > link? I've now tried this over multiple distro version and at > least one > PVM update, and it just doesn't work. Works fine over a wire, > fails on > wireless, and as far as I know wire and wireless are both "identical" > at the kernel interface layer so that any e.g. socket one might > open is > absolutely ecumenical about what the underlying hardware is (good old > ISO/OSI layering, right?). > From reuti at staff.uni-marburg.de Wed Feb 6 09:33:47 2008 From: reuti at staff.uni-marburg.de (Reuti) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] PVM on wireless... In-Reply-To: References: Message-ID: Hi, Am 06.02.2008 um 16:40 schrieb Robert G. Brown: > Anybody on list have any idea why PVM fails to add hosts over a > wireless > link? I've now tried this over multiple distro version and at > least one > PVM update, and it just doesn't work. Works fine over a wire, > fails on > wireless, and as far as I know wire and wireless are both "identical" > at the kernel interface layer so that any e.g. socket one might > open is > absolutely ecumenical about what the underlying hardware is (good old > ISO/OSI layering, right?). > > And yes, I'm well aware that from a latency/bw point of view this > arrangement isn't going to be a speed demon or scale terribly well, > but > for testing PVM from a laptop or writing code from a laptop or just > playing with PVM itself for fun or profit from a laptop it would > certainly be lovely if it WORKED, however poorly as far as IPCs are > concerned. > > Yup, tried it one last time. Locks it right up it does, have to kill > pvm[d] by hand and hand-remove the lockfiles, just like I did two or > three years ago... is the wireless one the primary interface? Maybe a mismatch which hostname is used for which interface? Using wireless could be similar like using a secondary interface. -- Reuti > rgb > > -- > Robert G. Brown Phone(cell): 1-919-280-8443 > Duke University Physics Dept, Box 90305 > Durham, N.C. 27708-0305 > Web: http://www.phy.duke.edu/~rgb > Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php > Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From rgb at phy.duke.edu Wed Feb 6 10:21:55 2008 From: rgb at phy.duke.edu (Robert G. Brown) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] PVM on wireless... In-Reply-To: <02A63D14-3E34-4C0E-A012-D491922AC023@ee.duke.edu> References: <02A63D14-3E34-4C0E-A012-D491922AC023@ee.duke.edu> Message-ID: On Wed, 6 Feb 2008, Bill Rankin wrote: > Hey Rob, > > Could it be a node naming issue where the wireless IP does not resolve to the > same address as that used in the machinefile? I seem to recall a similar > issue back when we PVM on machines with multiple network connections. pvmd is actually starting up on the target machine -- it works that far. The master node IP number is correct, as is the slave IP number (both visible as arguments to pvmd). The name I'm using is the one associated with the wireless interface in question, both machines ping in all four directions by name with the correct internet address. All my machines are configured more or less identically, use the same environment variables, support transparent ssh command execution (which obviously works even in PVM as the daemon is being spawned on the correct target). The wireless interfaces have the right MTU and look exactly like the ethernet devices they in fact are to the kernel AFAIK. In every other aspect I've ever tested, including my own homemade socket code, response to both tcp and udp daemons, ability to mount NFS, support ssh, and so on and so forth, they behave like TCP/IP sockets over ethernet devices as far as systems calls go -- they use the same interface, and the whole point of OSI/ISO is that code should not depend on the hardware layer and in general on even a roughly posix compliant machine using standard devices and e.g. the socket API it doesn't. Last time I encountered this, I actually cranked up the -d0x0 stuff and "watched" as the system went through to where it hung in the middle of doing some part of the post-spawn handshaking. I suspect a race condition, probably caused by using raw UDP with some assumption of latency during the handshake. The one way I can think of that the two connections differ is in their latency -- even the bandwidth of wireless is every bit as great as 10B2 networks I've run PVM on in years past (on proportionally slower CPUs, of course). If the master or slave send out an acknowledgement packet either before the window where the other can receive it or after it has grown bored and stopped listening, it might fail to properly bind or something. It seems like it would be a bug, not a feature, but if I were feeling infinitely masochistic and were to wander down into Other People's Source (ouch!) to try to debug this, that's what I'd look for first. Any PVM developers still on list? Any comments from them? rgb > > Just a thought, > > -bill > > > On Feb 6, 2008, at 10:40 AM, Robert G. Brown wrote: > >> Anybody on list have any idea why PVM fails to add hosts over a wireless >> link? I've now tried this over multiple distro version and at least one >> PVM update, and it just doesn't work. Works fine over a wire, fails on >> wireless, and as far as I know wire and wireless are both "identical" >> at the kernel interface layer so that any e.g. socket one might open is >> absolutely ecumenical about what the underlying hardware is (good old >> ISO/OSI layering, right?). > -- Robert G. Brown Phone(cell): 1-919-280-8443 Duke University Physics Dept, Box 90305 Durham, N.C. 27708-0305 Web: http://www.phy.duke.edu/~rgb Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From dnlombar at ichips.intel.com Wed Feb 6 11:06:08 2008 From: dnlombar at ichips.intel.com (Lombard, David N) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] PVM on wireless... In-Reply-To: References: Message-ID: <20080206190608.GA6306@nlxdcldnl2.cl.intel.com> On Wed, Feb 06, 2008 at 10:40:30AM -0500, Robert G. Brown wrote: > Anybody on list have any idea why PVM fails to add hosts over a wireless > link? I've now tried this over multiple distro version and at least one > PVM update, and it just doesn't work. Works fine over a wire, fails on > wireless, and as far as I know wire and wireless are both "identical" > at the kernel interface layer so that any e.g. socket one might open is > absolutely ecumenical about what the underlying hardware is (good old > ISO/OSI layering, right?). What is the device name? Perhaps PVM doesn't like the name? Are you running multiple devices? Does the system set its node name or is some odd name provided by DHCP? Other name resolution problems? > And yes, I'm well aware that from a latency/bw point of view this > arrangement isn't going to be a speed demon or scale terribly well, but > for testing PVM from a laptop or writing code from a laptop or just > playing with PVM itself for fun or profit from a laptop it would > certainly be lovely if it WORKED, however poorly as far as IPCs are > concerned. I've run multiple VMware instances on my *Linux* laptop back in the day when I did OSCAR development and Rocks evals. -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From James.P.Lux at jpl.nasa.gov Wed Feb 6 11:26:35 2008 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] PVM on wireless... In-Reply-To: References: Message-ID: <6.2.3.4.2.20080206110810.02e03408@mail.jpl.nasa.gov> At 07:40 AM 2/6/2008, Robert G. Brown wrote: >Anybody on list have any idea why PVM fails to add hosts over a wireless >link? I've now tried this over multiple distro version and at least one >PVM update, and it just doesn't work. Works fine over a wire, fails on >wireless, and as far as I know wire and wireless are both "identical" >at the kernel interface layer so that any e.g. socket one might open is >absolutely ecumenical about what the underlying hardware is (good old >ISO/OSI layering, right?). > >And yes, I'm well aware that from a latency/bw point of view this >arrangement isn't going to be a speed demon or scale terribly well, but >for testing PVM from a laptop or writing code from a laptop or just >playing with PVM itself for fun or profit from a laptop it would >certainly be lovely if it WORKED, however poorly as far as IPCs are >concerned. You brave man.. trying to do what is trivial in a wired network with wireless stuff. I would look for timing assumptions that aren't met in the wireless environment. There's a channel capacity issue, of course, but there's also some constraints on round trip messages, particularly if you've got a "infrastructure" network as opposed to "ad-hoc". A packet from A to B has to go from A to Access Point( AP), which takes some back and forth handshaking and protocol overhead. Then, it gets sent from AP to B, with more back and forth. Don't expect 1 ms ping times... I spent quite a while getting NTP (which I thought would be trivial.. it explicitly handles long delays and intermittent connections) to work in a 802.11a network, complicated by the fact that I was using Access Points (in a "point to multipoint" configuration) as the interfaces, so the computers actually had a wired ethernet connection through a dumb 5 port switch, to the wireless AP. Getting PXE and DHCP to work was trivial by comparison Lots of weird things happen in these systems because there are hidden assumptions about timing and whether a path exists between two points. Jim From reuti at staff.uni-marburg.de Wed Feb 6 11:52:09 2008 From: reuti at staff.uni-marburg.de (Reuti) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] PVM on wireless... In-Reply-To: References: <02A63D14-3E34-4C0E-A012-D491922AC023@ee.duke.edu> Message-ID: <220FE1C2-C27A-4B94-8060-D4D78DFCF50A@staff.uni-marburg.de> Am 06.02.2008 um 19:21 schrieb Robert G. Brown: > On Wed, 6 Feb 2008, Bill Rankin wrote: > >> Hey Rob, >> >> Could it be a node naming issue where the wireless IP does not >> resolve to the same address as that used in the machinefile? I >> seem to recall a similar issue back when we PVM on machines with >> multiple network connections. > > pvmd is actually starting up on the target machine -- it works that > far. > The master node IP number is correct, as is the slave IP number (both > visible as arguments to pvmd). The name I'm using is the one > associated > with the wireless interface in question, both machines ping in all > four > directions by name with the correct internet address. All my machines > are configured more or less identically, use the same environment > variables, support transparent ssh command execution (which obviously > works even in PVM as the daemon is being spawned on the correct > target). > > The wireless interfaces have the right MTU and look exactly like the > ethernet devices they in fact are to the kernel AFAIK. In every other > aspect I've ever tested, including my own homemade socket code, > response > to both tcp and udp daemons, ability to mount NFS, support ssh, and so > on and so forth, they behave like TCP/IP sockets over ethernet devices > as far as systems calls go -- they use the same interface, and the > whole > point of OSI/ISO is that code should not depend on the hardware layer > and in general on even a roughly posix compliant machine using > standard > devices and e.g. the socket API it doesn't. > > Last time I encountered this, I actually cranked up the -d0x0 stuff > and > "watched" as the system went through to where it hung in the middle of > doing some part of the post-spawn handshaking. Just an idea to check: PVM can also be started without rsh/ssh between the machines. You have to copy and paste some things from here to there and back and can startup all daemons this way by hand (page 30 in the PVM book). Maybe this works - just to narrow the cause. -- Reuti > I suspect a race condition, probably caused by using raw UDP with some > assumption of latency during the handshake. The one way I can > think of > that the two connections differ is in their latency -- even the > bandwidth of wireless is every bit as great as 10B2 networks I've run > PVM on in years past (on proportionally slower CPUs, of course). > If the > master or slave send out an acknowledgement packet either before the > window where the other can receive it or after it has grown bored and > stopped listening, it might fail to properly bind or something. It > seems like it would be a bug, not a feature, but if I were feeling > infinitely masochistic and were to wander down into Other People's > Source (ouch!) to try to debug this, that's what I'd look for first. > > Any PVM developers still on list? Any comments from them? > > rgb > >> >> Just a thought, >> >> -bill >> >> >> On Feb 6, 2008, at 10:40 AM, Robert G. Brown wrote: >> >>> Anybody on list have any idea why PVM fails to add hosts over a >>> wireless >>> link? I've now tried this over multiple distro version and at >>> least one >>> PVM update, and it just doesn't work. Works fine over a wire, >>> fails on >>> wireless, and as far as I know wire and wireless are both >>> "identical" >>> at the kernel interface layer so that any e.g. socket one might >>> open is >>> absolutely ecumenical about what the underlying hardware is (good >>> old >>> ISO/OSI layering, right?). >> > > -- > Robert G. Brown Phone(cell): 1-919-280-8443 > Duke University Physics Dept, Box 90305 > Durham, N.C. 27708-0305 > Web: http://www.phy.duke.edu/~rgb > Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php > Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From mathog at caltech.edu Wed Feb 6 13:01:22 2008 From: mathog at caltech.edu (David Mathog) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] Re: PVM on wireless... Message-ID: > Anybody on list have any idea why PVM fails to add hosts over a wireless > link? I've now tried this over multiple distro version and at least one > PVM update, and it just doesn't work. Works fine over a wire, fails on > wireless, and as far as I know wire and wireless are both "identical" > at the kernel interface layer so that any e.g. socket one might open is > absolutely ecumenical about what the underlying hardware is (good old > ISO/OSI layering, right?). Sounds like multiple network hell, with some type of name mismatch causing the problems. Start up pvmd directly on one of the wireless machines and then use pvm to see what it calls itself. If that differs in any way from the entries in your host list then that is probably the problem. If they come up the same then run -d settings on pvmd to find out more information. It is also possible the firewall settings are different, and the wired interface allows pvm connections in some way that the wireless does not. Did you try starting pvmd on a pure wireless machine and see if it can connect to other pure wireless machines? It would be good to get the wired interfaces completely out of the equation. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From rgb at phy.duke.edu Wed Feb 6 13:24:06 2008 From: rgb at phy.duke.edu (Robert G. Brown) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] PVM on wireless... In-Reply-To: <6.2.3.4.2.20080206110810.02e03408@mail.jpl.nasa.gov> References: <6.2.3.4.2.20080206110810.02e03408@mail.jpl.nasa.gov> Message-ID: On Wed, 6 Feb 2008, Jim Lux wrote: > You brave man.. trying to do what is trivial in a wired network with wireless > stuff. > > I would look for timing assumptions that aren't met in the wireless > environment. There's a channel capacity issue, of course, but there's also > some constraints on round trip messages, particularly if you've got a > "infrastructure" network as opposed to "ad-hoc". A packet from A to B has to > go from A to Access Point( AP), which takes some back and forth handshaking > and protocol overhead. Then, it gets sent from AP to B, with more back and > forth. Don't expect 1 ms ping times... > > I spent quite a while getting NTP (which I thought would be trivial.. it > explicitly handles long delays and intermittent connections) to work in a > 802.11a network, complicated by the fact that I was using Access Points (in a > "point to multipoint" configuration) as the interfaces, so the computers > actually had a wired ethernet connection through a dumb 5 port switch, to the > wireless AP. Getting PXE and DHCP to work was trivial by comparison > > Lots of weird things happen in these systems because there are hidden > assumptions about timing and whether a path exists between two points. This is what I think that it probably is -- a race condition of some sort caused by a timing assumption, almost certainly of UDP packets as TCP should be robust. I should look at whether PVM can be built on top of just TCP these days. It used to be UDP "for efficiency" but that always means that you have to code your own reliability, packet reordering and so on into the connection, usually leaving some things OUT or you'll end up re-implementing TCP anyway, probably badly. I could be bitten by something left out that is causing certain packet sequences to arrive out of (presumed) order and have the master waiting forever for a packet that already came and was dropped. But only a look at the raw code (or setting a tcp-only flag at build time, if there is such a thing) will tell me for sure. rgb > > Jim > > -- Robert G. Brown Phone(cell): 1-919-280-8443 Duke University Physics Dept, Box 90305 Durham, N.C. 27708-0305 Web: http://www.phy.duke.edu/~rgb Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From rgb at phy.duke.edu Wed Feb 6 13:28:24 2008 From: rgb at phy.duke.edu (Robert G. Brown) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] PVM on wireless... In-Reply-To: <220FE1C2-C27A-4B94-8060-D4D78DFCF50A@staff.uni-marburg.de> References: <02A63D14-3E34-4C0E-A012-D491922AC023@ee.duke.edu> <220FE1C2-C27A-4B94-8060-D4D78DFCF50A@staff.uni-marburg.de> Message-ID: On Wed, 6 Feb 2008, Reuti wrote: > Just an idea to check: PVM can also be started without rsh/ssh between the > machines. You have to copy and paste some things from here to there and back > and can startup all daemons this way by hand (page 30 in the PVM book). Maybe > this works - just to narrow the cause. I'll look into this, thanks, although the daemon IS started -- the block it is somewhere after that. But it is well worth trying anyway. I also wonder about ports and WAP interactions. I've got my WAP configured (AFAICT) as an internal switch, not really as a router. As in my laptop get DCHP service from my linux server, not the WAP, which is flat to broadcasts, has no port filtering on the internal network etc. I even ran tcpdump on the problem last time it happened -- maybe I should try it again. rgb > > -- Reuti > > >> I suspect a race condition, probably caused by using raw UDP with some >> assumption of latency during the handshake. The one way I can think of >> that the two connections differ is in their latency -- even the >> bandwidth of wireless is every bit as great as 10B2 networks I've run >> PVM on in years past (on proportionally slower CPUs, of course). If the >> master or slave send out an acknowledgement packet either before the >> window where the other can receive it or after it has grown bored and >> stopped listening, it might fail to properly bind or something. It >> seems like it would be a bug, not a feature, but if I were feeling >> infinitely masochistic and were to wander down into Other People's >> Source (ouch!) to try to debug this, that's what I'd look for first. >> >> Any PVM developers still on list? Any comments from them? >> >> rgb >> >>> >>> Just a thought, >>> >>> -bill >>> >>> >>> On Feb 6, 2008, at 10:40 AM, Robert G. Brown wrote: >>> >>>> Anybody on list have any idea why PVM fails to add hosts over a wireless >>>> link? I've now tried this over multiple distro version and at least one >>>> PVM update, and it just doesn't work. Works fine over a wire, fails on >>>> wireless, and as far as I know wire and wireless are both "identical" >>>> at the kernel interface layer so that any e.g. socket one might open is >>>> absolutely ecumenical about what the underlying hardware is (good old >>>> ISO/OSI layering, right?). >>> >> >> -- >> Robert G. Brown Phone(cell): 1-919-280-8443 >> Duke University Physics Dept, Box 90305 >> Durham, N.C. 27708-0305 >> Web: http://www.phy.duke.edu/~rgb >> Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php >> Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf -- Robert G. Brown Phone(cell): 1-919-280-8443 Duke University Physics Dept, Box 90305 Durham, N.C. 27708-0305 Web: http://www.phy.duke.edu/~rgb Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From rgb at phy.duke.edu Wed Feb 6 13:42:04 2008 From: rgb at phy.duke.edu (Robert G. Brown) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] Re: PVM on wireless... In-Reply-To: References: Message-ID: On Wed, 6 Feb 2008, David Mathog wrote: > >> Anybody on list have any idea why PVM fails to add hosts over a wireless >> link? I've now tried this over multiple distro version and at least one >> PVM update, and it just doesn't work. Works fine over a wire, fails on >> wireless, and as far as I know wire and wireless are both "identical" >> at the kernel interface layer so that any e.g. socket one might open is >> absolutely ecumenical about what the underlying hardware is (good old >> ISO/OSI layering, right?). > > Sounds like multiple network hell, with some type of name mismatch > causing the problems. Start up pvmd directly on one of the wireless > machines and then use pvm to see what it calls itself. If that > differs in any way from the entries in your host list then that is > probably the problem. If they come up the same then run -d settings on > pvmd to find out more information. > > It is also possible the firewall settings are different, and the wired > interface allows pvm connections in some way that the wireless does not. > > Did you try starting pvmd on a pure wireless machine and see if it can > connect to other pure wireless machines? It would be good to get the > wired interfaces completely out of the equation. Any connection with wireless on at least one end fails. Or if you like, only wire-to-wire succeeds. And I HAVE been doing TCP/IP sysadmin for about twenty-one years now, pro-grade linux for twelve-plus. I really don't think that there is much of a chance left that there is any trivial networking error underlying this, as of course I've checked this pretty carefully (in two completely different instances, with significant changes to my home network -- different primary server, different WAP, different wireless cards, different laptops and as I said, the mapping between IP number and slave pvmd is exactly correct as are all host table entries, ping works by name or IP to the same IP(s), ssh works by name or IP, http works ditto, wulfware works ditto (and shows both interfaces), NM manages wireless now while then I did it by hand, the kernels 2.4 vs 2.6 different, yet the symptoms are exactly the same. It works to a point just half-way through the handshaking and then, AFTER the remote daemon is successfully spawned with the right lockfiles and IP numbers visible to ps with ww, it freezes until something times out, then it fails while claiming that it succeeded in adding the remote host. I can literally snap the same box onto a wire, wait for it to get an IP number on the wire, and rerun the experiment on the same hardware and it works perfectly (with a different but identically entered name, of course). And it is the wireless name that corresponds with the `hostname` (in /etc/sysconfig/network), not that this should matter (and it doesn't on the wire). That's not to say that I can't make a mistake -- only that I've checked all the really obvious ones and EVERYTHING ELSE works perfectly and universally independent of wire vs wired. I snap in a wire in my office, snap it off the wire and onto wireless, and back again, back and forth home to office many times per boot. After about ten days of this NM will sometimes destabilize as maybe the wireless card fails to hold state perfectly, but in the meantime every network-using tool BUT pvm just works, exactly as one expects. rgb > > Regards, > > David Mathog > mathog@caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown Phone(cell): 1-919-280-8443 Duke University Physics Dept, Box 90305 Durham, N.C. 27708-0305 Web: http://www.phy.duke.edu/~rgb Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From James.P.Lux at jpl.nasa.gov Wed Feb 6 14:16:13 2008 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] PVM on wireless... In-Reply-To: References: <02A63D14-3E34-4C0E-A012-D491922AC023@ee.duke.edu> <220FE1C2-C27A-4B94-8060-D4D78DFCF50A@staff.uni-marburg.de> Message-ID: <6.2.3.4.2.20080206140052.032df040@mail.jpl.nasa.gov> At 01:28 PM 2/6/2008, Robert G. Brown wrote: >On Wed, 6 Feb 2008, Reuti wrote: > >>Just an idea to check: PVM can also be started without rsh/ssh >>between the machines. You have to copy and paste some things from >>here to there and back and can startup all daemons this way by hand >>(page 30 in the PVM book). Maybe this works - just to narrow the cause. > >I'll look into this, thanks, although the daemon IS started -- the block >it is somewhere after that. But it is well worth trying anyway. > >I also wonder about ports and WAP interactions. I've got my WAP >configured (AFAICT) as an internal switch, not really as a router. As >in my laptop get DCHP service from my linux server, not the WAP, which >is flat to broadcasts, has no port filtering on the internal network >etc. Ahh.. but there is a "routing" process of sorts inside your AP. It has to bridge from the 802.3 wired world to the 802.11 wireless world, and that usually involves some store and forward type processing. Some of these are implemented as a store and forward router (e.g. home firewall) with one of the logical ports connected to the wireless modem. Very, very few access points (if any) are actually a dumb packet oriented bridge that just unwraps the payload from one frame type and shoots it out rewrapped for the other. The AP has to do things like send out broadcast frames with the SSID, send and receive the link setup/teardown kinds of frames (i.e. the link between your PC's wireless interface and the AP), as well as bridging/routing traffic from the wired network to the wireless network. Who's to say what kind of logic they have inside there to deal with all the issues (the wireless MAC and the wired MAC are different, if nothing else.) Jim Lux From kohlja at ornl.gov Wed Feb 6 15:13:28 2008 From: kohlja at ornl.gov (kohlja@ornl.gov) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] PVM on wireless... Message-ID: <20080206231328.GA1249@neo.csm.ornl.gov> Hey Gang! Sounds like you're having some "fun" with PVM over wireless...? :-) (A buddy (Wael Elwasif) forwarded your discussion to me; please always feel free to copy "pvm@msr.csm.ornl.gov" with PVM inquiries when you get stuck. I try to be pretty responsive, though this is all unfunded work now... :) So, the master's network interface/IP selection was my first guess, too, after reading about your situation, but this email below would seem to eliminate that possibility... Just to be sure though, I assume you're starting PVM on the master host with the "-nfoo" host name argument, to choose the desired network interface/IP address, and that the /tmp/pvml. log file on the master reflects/verifies this IP...? :) Are there any error messages in the PVM log files on either the master or the slave machines...? (Btw, which $PVM_ARCH are we talking about here, "LINUX" or "BEOLIN"...? :) There are a few weird scenarios under which PVM will quietly drop or ignore packets coming from the slave daemons, when the IP doesn't appear to match what the master expects... ("to serve you better" and protect against external intrusions, ha ha ha... :) As far as timing out/latency, which was another line of your discussion I read through, I don't _think_ PVM cares about the fine-grained latency that you're talking about, between wireless and wired... The daemons are on a nice long timeout, like 3 _minutes_ before they assume something died... And for startup, the master doesn't strictly "wait" for the slaves to connect, it merely provides them with the proper socket address for where to connect themselves up... (hence the option you've mentioned about manually starting a slave daemon, and having it just connect up to the master) So what about firewalls or blocked ports...? Does the wireless network leave the PVM ports open? (The port number is chosen at random by the system, unless the "$PVMNETSOCKPORT" environment variable is set with a starting port number for the desired range...) Anything in the master's regular system logs (or the slave's PVM log file) about "Connection Refused"...? Just an idear. Please lemme know if this is all still a dead end. (And send along any error messages from the PVM logs...! :-) Good Luck & "Long Live PVM"...! :) Jeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeem ;) (a.k.a. Jim Kohl, kohlja@ornl.gov :) > From: "Robert G. Brown" > Date: Wed, 6 Feb 2008 13:21:55 -0500 (EST) > Subject: Re: [Beowulf] PVM on wireless... > To: Bill Rankin > Cc: Beowulf Mailing List > Message-ID: > Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed > > On Wed, 6 Feb 2008, Bill Rankin wrote: > > > Hey Rob, > > > > Could it be a node naming issue where the wireless IP does not resolve > > to > > the same address as that used in the machinefile? I seem to recall a > > similar issue back when we PVM on machines with multiple network > > connections. > > pvmd is actually starting up on the target machine -- it works that far. > The master node IP number is correct, as is the slave IP number (both > visible as arguments to pvmd). The name I'm using is the one associated > with the wireless interface in question, both machines ping in all four > directions by name with the correct internet address. All my machines > are configured more or less identically, use the same environment > variables, support transparent ssh command execution (which obviously > works even in PVM as the daemon is being spawned on the correct target). > > The wireless interfaces have the right MTU and look exactly like the > ethernet devices they in fact are to the kernel AFAIK. In every other > aspect I've ever tested, including my own homemade socket code, response > to both tcp and udp daemons, ability to mount NFS, support ssh, and so > on and so forth, they behave like TCP/IP sockets over ethernet devices > as far as systems calls go -- they use the same interface, and the whole > point of OSI/ISO is that code should not depend on the hardware layer > and in general on even a roughly posix compliant machine using standard > devices and e.g. the socket API it doesn't. > > Last time I encountered this, I actually cranked up the -d0x0 stuff and > "watched" as the system went through to where it hung in the middle of > doing some part of the post-spawn handshaking. > > I suspect a race condition, probably caused by using raw UDP with some > assumption of latency during the handshake. The one way I can think of > that the two connections differ is in their latency -- even the > bandwidth of wireless is every bit as great as 10B2 networks I've run > PVM on in years past (on proportionally slower CPUs, of course). If the > master or slave send out an acknowledgement packet either before the > window where the other can receive it or after it has grown bored and > stopped listening, it might fail to properly bind or something. It > seems like it would be a bug, not a feature, but if I were feeling > infinitely masochistic and were to wander down into Other People's > Source (ouch!) to try to debug this, that's what I'd look for first. > > Any PVM developers still on list? Any comments from them? > > rgb (:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(: James Arthur "Jeeembo" Kohl, Ph.D. "Da Blooos Brathas?! They Oak Ridge National Laboratory still owe you money, Fool!" kohlja@ornl.gov http://www.csm.ornl.gov/~kohl/ Long Live Curtis Blues!!! :):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):) From wrankin at ee.duke.edu Wed Feb 6 15:18:27 2008 From: wrankin at ee.duke.edu (Bill Rankin) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] Re: PVM on wireless... In-Reply-To: References: Message-ID: <3622EF04-4040-4EBD-AF07-0BF4D1CAB8AD@ee.duke.edu> I have a home setup similar to yours - a WAP acting as a firewall, dhcp from a linux server. I have a spare laptop running CentOS, so I'll give it a check tonight to see if mine runs. Q1: to you have DHCP giving a static address to your laptop based upon it's MAC? Q2: have you tried this with the PVM 3.4.4 RPMs (I think you mentioned you were running 3.4.5)? -b > > I can literally snap the same box onto a wire, wait for it to get > an IP > number on the wire, and rerun the experiment on the same hardware > and it > works perfectly (with a different but identically entered name, of > course). And it is the wireless name that corresponds with the > `hostname` (in /etc/sysconfig/network), not that this should matter > (and > it doesn't on the wire). > From reuti at staff.uni-marburg.de Wed Feb 6 16:39:58 2008 From: reuti at staff.uni-marburg.de (Reuti) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] Re: PVM on wireless... In-Reply-To: <3622EF04-4040-4EBD-AF07-0BF4D1CAB8AD@ee.duke.edu> References: <3622EF04-4040-4EBD-AF07-0BF4D1CAB8AD@ee.duke.edu> Message-ID: <42BC3720-4E9C-43D9-9F9D-2D82F0590D9D@staff.uni-marburg.de> Am 07.02.2008 um 00:18 schrieb Bill Rankin: > I have a home setup similar to yours - a WAP acting as a firewall, > dhcp from a linux server. I have a spare laptop running CentOS, so > I'll give it a check tonight to see if mine runs. > > Q1: to you have DHCP giving a static address to your laptop based > upon it's MAC? > > Q2: have you tried this with the PVM 3.4.4 RPMs (I think you > mentioned you were running 3.4.5)? There are even newer patches: http://www.csm.ornl.gov/~kohl/PVM/pvm3.4.5+9.tar.Z -- Reuti > -b > >> >> I can literally snap the same box onto a wire, wait for it to get >> an IP >> number on the wire, and rerun the experiment on the same hardware >> and it >> works perfectly (with a different but identically entered name, of >> course). And it is the wireless name that corresponds with the >> `hostname` (in /etc/sysconfig/network), not that this should >> matter (and >> it doesn't on the wire). >> > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From rgb at phy.duke.edu Thu Feb 7 08:08:46 2008 From: rgb at phy.duke.edu (Robert G. Brown) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] PVM on wireless... In-Reply-To: <6.2.3.4.2.20080206140052.032df040@mail.jpl.nasa.gov> References: <02A63D14-3E34-4C0E-A012-D491922AC023@ee.duke.edu> <220FE1C2-C27A-4B94-8060-D4D78DFCF50A@staff.uni-marburg.de> <6.2.3.4.2.20080206140052.032df040@mail.jpl.nasa.gov> Message-ID: On Wed, 6 Feb 2008, Jim Lux wrote: > Ahh.. but there is a "routing" process of sorts inside your AP. It has to > bridge from the 802.3 wired world to the 802.11 wireless world, and that > usually involves some store and forward type processing. Some of these are > implemented as a store and forward router (e.g. home firewall) with one of > the logical ports connected to the wireless modem. Very, very few access > points (if any) are actually a dumb packet oriented bridge that just unwraps > the payload from one frame type and shoots it out rewrapped for the other. > The AP has to do things like send out broadcast frames with the SSID, send > and receive the link setup/teardown kinds of frames (i.e. the link between > your PC's wireless interface and the AP), as well as bridging/routing traffic > from the wired network to the wireless network. > > Who's to say what kind of logic they have inside there to deal with all the > issues (the wireless MAC and the wired MAC are different, if nothing else.) No arguments, but... As far as the programmer API is concerned, IP is IP is IP, TCP is even more removed. The whole point of TCP, in fact, is that one is NOT supposed to need to know or care if the packet one is wrapping up for some destination is about to go out on a wire or wireless link, travel over copper or fiber, pass through hubs, bridges, routers, switches. A properly formed packet that isn't in a channel controlled by e.g. firewalls or port blockers is "guaranteed" to reach its destination, if its destination be reachable and correctly bidirectionally routed, and even to be resequenced and/or retransmitted if need be until the entire message is at least "reasonably" accurately received by the receiver. UDP is somewhat different. It is a connectionless protocol, for one thing. However, the most important difference is that it is not a "reliable" protocol -- is is close to what one might call "raw" IP. Form a packet, drop it on the wire, pray that it is received. If it is part of a sequence of packets, pray that they are received in the correct order, as WAN connections may well switch routes or delay individual packets in route as the circumstances of traffic dictate or lose a packet altogether. Services built on UDP (PVM and at one time NFS) have to basically replicate TCP's e.g. packet sequencing and reliable delivery checks in order to become reliable. Ordinarily UDP based services are non-critical, and they are usually offered only "on the same wire" -- on a network without a lot of routing hops in between, although switched connections or single-hop bridges don't usually constitute a problem -- unless UDP is so augmented to make it reliable, and even then it is RARE to run a UDP-based service over a WAN AFAIK. I still don't seriously suspect that WAP per se, because every other service in the Universe, TCP or UDP or ICMP, that I've used over wireless works perfectly, always. Oh, the connection itself isn't horribly reliable -- turn on the microwave oven, drop the link, load the device heavily, links get a bit flaky -- but EVERYthing works when the link is up and solid. To the best of my ability to test it (which isn't terribly shabby, given nmap after all), it is transparent to IP from broadcasts on down to individual ports on the local bridged 192.168.1.x network, in both directions. What is different on a WAP is timing (e.g. latency). As you say, there's a fair bit of out-of-band traffic associated with wireless links. My MIMO router up to the very latest firmware upgrade would generate all sorts of spurious traffic that I suspect was associated with link optimization and so on, but of course it was difficult to know for sure as it was largely out of band. Even so, however, it is really, really odd that PVM has a segment that is so sensitive (or so unusual in terms of its socket code) that it fails while everything else works. Anyway, it sounds like the general answer is that nobody on list has really encountered this or knows what is causing it, so I guess my choices are to grab the PVM sources, do a build, do a -d0x0 run to isolate once again the precise point where the process of adding a wireless host fails, instrument the code to the point (possibly on both ends -- it could be the target PVMD as easily as the master) where I can actually see what is or isn't getting through, and then either figure out why and "properly" fix it or muck around with the code to where the problem goes away even though I don't know why (by e.g. inserting "arbitrary" delays here or there to give a wireless network time to catch up and avoid a race, which sucks I agree but which often works anyway...;-) OR to just blow it off again and live with it, like I did last time. Or I suppose I could always file a bugzilla report and hope that it filters back to the developers who actually know the code and can properly fix it. Hmmm, time time time. Who has the time. rgb > > Jim Lux > > -- Robert G. Brown Phone(cell): 1-919-280-8443 Duke University Physics Dept, Box 90305 Durham, N.C. 27708-0305 Web: http://www.phy.duke.edu/~rgb Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From wrankin at ee.duke.edu Thu Feb 7 09:15:34 2008 From: wrankin at ee.duke.edu (Bill Rankin) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] Re: PVM on wireless... In-Reply-To: <3622EF04-4040-4EBD-AF07-0BF4D1CAB8AD@ee.duke.edu> References: <3622EF04-4040-4EBD-AF07-0BF4D1CAB8AD@ee.duke.edu> Message-ID: <6DAAD27B-C644-4C8F-AFD9-D7A2DEE65C35@ee.duke.edu> Update. I got PVM running on my laptop and successfully added one of my servers to the hostlist using the command line at the pvm prompt. This was over wireless. The laptop was running pvm 3.4.5-7 rpm under CentOS 5. The other machine had pvm 3.4.4 under CentOS 4. The main bits seemed to be: Getting PVM_ROOT=/usr/share/pvm3 and PVM_RSH=ssh set on both sides (added to .bashrc). Checked by doing an 'ssh export' and verified contents. Rob: do you do host-based authentication under ssh? I don't, so I had to type in my passwords at the 'pvm>' prompt. Sorry I can't offer anything more. -bill On Feb 6, 2008, at 6:18 PM, Bill Rankin wrote: > I have a home setup similar to yours - a WAP acting as a firewall, > dhcp from a linux server. I have a spare laptop running CentOS, so > I'll give it a check tonight to see if mine runs. > > Q1: to you have DHCP giving a static address to your laptop based > upon it's MAC? > > Q2: have you tried this with the PVM 3.4.4 RPMs (I think you > mentioned you were running 3.4.5)? > > -b > >> >> I can literally snap the same box onto a wire, wait for it to get >> an IP >> number on the wire, and rerun the experiment on the same hardware >> and it >> works perfectly (with a different but identically entered name, of >> course). And it is the wireless name that corresponds with the >> `hostname` (in /etc/sysconfig/network), not that this should >> matter (and >> it doesn't on the wire). >> > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From rgb at phy.duke.edu Thu Feb 7 09:55:31 2008 From: rgb at phy.duke.edu (Robert G. Brown) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] PVM on wireless... In-Reply-To: <20080206231328.GA1249@neo.csm.ornl.gov> References: <20080206231328.GA1249@neo.csm.ornl.gov> Message-ID: On Wed, 6 Feb 2008, kohlja@ornl.gov wrote: > Hey Gang! > > Sounds like you're having some "fun" with PVM over wireless...? :-) > > (A buddy (Wael Elwasif) forwarded your discussion to me; > please always feel free to copy "pvm@msr.csm.ornl.gov" > with PVM inquiries when you get stuck. I try to be > pretty responsive, though this is all unfunded work now... :) Bless you. However, I've just manage to figure the problem out on my own. It is, after all, a firewall issue. There are apparently different/new defaults in Fedora 7 and 8 than I expected. If I >>completely disable<< the firewall it works. This isn't really desireable, so I'll go back and see if I can figure out how to open the minimal set of ports to make it work. I wasn't seeing it in my earlier tests because I was verifying that it worked FROM a newly installed wired Fedora 8 host to my older hosts, that happened to be wired, or to a fedora 7 or fedora 8 laptop that wouldn't work even with the appropriate interfaces set to trusted. When just to be thorough I tried to configure the F8 wired system from an older F6 wired system, it failed too, which led me to try disabling the firewall altogether. I apologize to all those who wasted time trying to help me with something I should have figured out on my own. I was fooled by the accidental appearance of order, with both my extant laptops running the same dysfunctional firewall, and by testing connections only FROM my one wired host running F8. I should have just kept plugging until I tested to and from every pair. While I've got the One True PVM Human(s) on the line, though -- a suggestion for PVM to help others avoid this problem in the future on networks wired and wireless: It would really, really help if man pvm (or man pvmd or man pvm_intro) documented a suitable firewall setting that will let PVM function without just turning off the firewall altogether. There is no pvm setup in /etc/services, for example, no pvm checkbox in the panels managed by system-config-firewall in the latest Fedoras, no suggestion as to what what protected port(s) or ranges one has to enable explicitly. In fact for once even google is failing me -- I'm not finding a lot of documentation or remarks by ANYONE on what ports pvm needs open (besides ssh, which obviously is open and works). Usually as long as the spawning of a network application itself works using an enabled protected port (in this case, I would have expected ssh), the secondary ports opened in unprotected space just work. Am I wrong in this? Do I need to explicitly open more ports somewhere? To find out, this leaves me with running e.g. tcpdump and watching as pvm attempts to connect, opening port ranges one at a time and doing a binary search, or something similarly painful. Or just asking you. So what (minimal set of) ports do I need to leave open besides ssh, which is always open on my systems anyway? An additional suggestion would be to (if possible) have the RPM install "fix" the port situation so that pvm shows up on system-config-firewall and/or finish with a message to the installer that a particular firewall setting must be installed or enabled and/or add something to the debugging info provided by pvm so that on a timeout (in particular) it prints something like "Unable to connect due to timeout. Verify that pvm is correctly installed and that port range xxxx-xxxx is open on the target." I actually help a lot of people get started with PVM (they write me offline because I have a template PVM tarball up on my personal website) and the more I know, the better I can help them...;-) rgb -- Robert G. Brown Phone(cell): 1-919-280-8443 Duke University Physics Dept, Box 90305 Durham, N.C. 27708-0305 Web: http://www.phy.duke.edu/~rgb Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From wrankin at ee.duke.edu Thu Feb 7 10:34:37 2008 From: wrankin at ee.duke.edu (Bill Rankin) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] PVM on wireless... In-Reply-To: References: <20080206231328.GA1249@neo.csm.ornl.gov> Message-ID: <99265B4F-305C-4E99-AA7F-93CAABE571B8@ee.duke.edu> I think that I managed to replicate your problem, Rob. Laptop running CentoOS5, pvm 3.4.5-7(rpm), wireless ethernet. Server running FC6, pvm 3.4.5-7(rpm) Ssh working fine in both directions, PVM_ROOT and PVM_RSH set accordingly. Running "pvm" from the shell on the server and doing an "add " at the prompt. Prompted for password. PVM then hangs waiting to add remote host. On the remote host, we see the pvmd running with a "ps". If I do nothing: the remote pvmd eventually dies and after that the command prompt on the server returns with a "1 successful" message, but a "conf" command shows that no hosts were added. Here is the weird part: if after I issue the "add " command, I then go over to the laptop and run "pvm" from a shell, the connection is made and the hosts are successfully added. So you may want to try this and see if you get similar behavior. Last datapoint: if from my laptop I attempt to add a host that has PVM 3.4.4 (CentOS4 rpm) installed, it starts up fine. So I think that it's a bug in 3.4.5-7. I haven't tried it over a wired connection yet. So you may want to try dropping back to version 3.4.4 on all machines and see if that helps. Jim Kohl at ORNL seems to have several patches to 3.4.5, and I'm wondering if this issue has already been addressed. -bill From rgb at phy.duke.edu Thu Feb 7 10:41:57 2008 From: rgb at phy.duke.edu (Robert G. Brown) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] PVM on wireless... In-Reply-To: <99265B4F-305C-4E99-AA7F-93CAABE571B8@ee.duke.edu> References: <20080206231328.GA1249@neo.csm.ornl.gov> <99265B4F-305C-4E99-AA7F-93CAABE571B8@ee.duke.edu> Message-ID: On Thu, 7 Feb 2008, Bill Rankin wrote: > I think that I managed to replicate your problem, Rob. > > Laptop running CentoOS5, pvm 3.4.5-7(rpm), wireless ethernet. > Server running FC6, pvm 3.4.5-7(rpm) > > Ssh working fine in both directions, PVM_ROOT and PVM_RSH set accordingly. Try it with the firewalls completely down and I'll bet it works. However, it is REALLY strange that it works with them UP for some combinations. Or not so strange -- that's what was fooling me, after all. Perhaps the port ranges being used are varying with version or chance. rgb > > Running "pvm" from the shell on the server and doing an "add " at the > prompt. > Prompted for password. > PVM then hangs waiting to add remote host. > On the remote host, we see the pvmd running with a "ps". > > If I do nothing: the remote pvmd eventually dies and after that the command > prompt on the server returns with a "1 successful" message, but a "conf" > command shows that no hosts were added. > > Here is the weird part: if after I issue the "add " command, I then > go over to the laptop and run "pvm" from a shell, the connection is made and > the hosts are successfully added. > > So you may want to try this and see if you get similar behavior. > > > Last datapoint: if from my laptop I attempt to add a host that has PVM 3.4.4 > (CentOS4 rpm) installed, it starts up fine. So I think that it's a bug in > 3.4.5-7. I haven't tried it over a wired connection yet. > > So you may want to try dropping back to version 3.4.4 on all machines and see > if that helps. > > > Jim Kohl at ORNL seems to have several patches to 3.4.5, and I'm wondering if > this issue has already been addressed. > > > -bill -- Robert G. Brown Phone(cell): 1-919-280-8443 Duke University Physics Dept, Box 90305 Durham, N.C. 27708-0305 Web: http://www.phy.duke.edu/~rgb Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From kohlja at ornl.gov Thu Feb 7 10:53:04 2008 From: kohlja at ornl.gov (kohlja@ornl.gov) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] PVM on wireless... In-Reply-To: References: <20080206231328.GA1249@neo.csm.ornl.gov> Message-ID: <20080207185304.GA11286@neo.csm.ornl.gov> Hi Robert/Rob/RGB! :-) On Thu, Feb 07, 2008 at 12:55:31PM -0500, Robert G. Brown wrote: > On Wed, 6 Feb 2008, kohlja@ornl.gov wrote: >> Hey Gang! >> Sounds like you're having some "fun" with PVM over wireless...? :-) >> (A buddy (Wael Elwasif) forwarded your discussion to me; >> please always feel free to copy "pvm@msr.csm.ornl.gov" >> with PVM inquiries when you get stuck. I try to be >> pretty responsive, though this is all unfunded work now... :) > Bless you. De nada, you're welcome. :-) > However, I've just manage to figure the problem out on my own. It is, > after all, a firewall issue... Ah, Good! Glad that's all it was, not that it wasn't a hassle to identify! :) Sorry it was so non-obvious from the PVM side of things... :-b > While I've got the One True PVM Human(s) on the line, though... Mwuahahahahahaaaa... :-) > -- a suggestion for PVM to help others avoid this problem in the future > on networks wired and wireless: > It would really, really help if man pvm (or man pvmd or man pvm_intro) > documented a suitable firewall setting that will let PVM function > without just turning off the firewall altogether. There is no pvm setup > in /etc/services, for example, no pvm checkbox in the panels managed by > system-config-firewall in the latest Fedoras, no suggestion as to what > what protected port(s) or ranges one has to enable explicitly. In fact > for once even google is failing me -- I'm not finding a lot of > documentation or remarks by ANYONE on what ports pvm needs open (besides > ssh, which obviously is open and works). Usually as long as the > spawning of a network application itself works using an enabled > protected port (in this case, I would have expected ssh), the secondary > ports opened in unprotected space just work. Am I wrong in this? Do I > need to explicitly open more ports somewhere? Ah Yes. O.K., so I wish it was that simple, but alas PVM can use as many ports as you have machines in your cluster, or could use just 1. :-} Normally, the master pvmd creates/accepts connections over a small set of ports, possibly 1, but if PvmRouteDirect is enabled in a PVM application, then a myriad of direct-connection socket links are created, to link whichever machines the local PVM application tasks communicate with, on a demand-driven basis... So it's not generally possible to specify an explicit "range" of ports. However, it _is_ possible to set the "starting" port for this collection, using the aforementioned "$PVMNETSOCKPORT" environment variable. This sets the first port that PVM will try to use, and all subsequent ports will usually be consecutive positive increments of that starting port (i.e. PVMNETSOCKPORT++... :-). So in most cases, you could probably plan on opening up a 100 or 1000 ports _somewhere_ in your firewall, depending on your needs, and then just tell PVM where to start, using $PVMNETSOCKPORT... I've always considered this solution a bit of a kludge, which is why it doesn't show up in the man pages, but if it works well enough to save people lots of hassle, then I can add some commentary on it...? > To find out, this leaves me with running e.g. tcpdump and watching as > pvm attempts to connect, opening port ranges one at a time and doing a > binary search, or something similarly painful. Or just asking you. So > what (minimal set of) ports do I need to leave open besides ssh, which > is always open on my systems anyway? > An additional suggestion would be to (if possible) have the RPM install > "fix" the port situation so that pvm shows up on system-config-firewall > and/or finish with a message to the installer that a particular firewall > setting must be installed or enabled and/or add something to the > debugging info provided by pvm so that on a timeout (in particular) it > prints something like "Unable to connect due to timeout. Verify that > pvm is correctly installed and that port range xxxx-xxxx is open on the > target." You _should_ be getting some sort of timeout message in the slave pvmd's log file (/tmp/pvml. on the slave machine), when the connection request to the master pvmd doesn't get a reply...? It may depend on the firewall settings, but a nice "Connection Refused" would usually go a long way toward diagnosing things, whereas the more secure firewall alternative of simply "no response" would only result in a "timed out" PVM message... I'm open to suggestions on ways to identify or diagnose the problem...! Thanks Much for your interest and feedback! All the Best, Jeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeem ;) > I actually help a lot of people get started with PVM (they write me > offline because I have a template PVM tarball up on my personal website) > and the more I know, the better I can help them...;-) > rgb > -- > Robert G. Brown Phone(cell): 1-919-280-8443 > Duke University Physics Dept, Box 90305 > Durham, N.C. 27708-0305 > Web: http://www.phy.duke.edu/~rgb > Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php > Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 (:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(: James Arthur "Jeeembo" Kohl, Ph.D. "Da Blooos Brathas?! They Oak Ridge National Laboratory still owe you money, Fool!" kohlja@ornl.gov http://www.csm.ornl.gov/~kohl/ Long Live Curtis Blues!!! :):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):) From wrankin at ee.duke.edu Thu Feb 7 11:23:13 2008 From: wrankin at ee.duke.edu (Bill Rankin) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] PVM on wireless... In-Reply-To: References: <20080206231328.GA1249@neo.csm.ornl.gov> <99265B4F-305C-4E99-AA7F-93CAABE571B8@ee.duke.edu> Message-ID: <5D962A53-7593-4370-98FC-F0D74702208A@ee.duke.edu> On Feb 7, 2008, at 1:41 PM, Robert G. Brown wrote: > On Thu, 7 Feb 2008, Bill Rankin wrote: > >> I think that I managed to replicate your problem, Rob. >> >> Laptop running CentoOS5, pvm 3.4.5-7(rpm), wireless ethernet. >> Server running FC6, pvm 3.4.5-7(rpm) >> >> Ssh working fine in both directions, PVM_ROOT and PVM_RSH set >> accordingly. > > Try it with the firewalls completely down and I'll bet it works. Well, duh. Yeah, that was it. Although disabling the firewall on a wireless connection does not give me the warm fuzzies. > However, it is REALLY strange that it works with them UP for some > combinations. Or not so strange -- that's what was fooling me, after > all. Perhaps the port ranges being used are varying with version or > chance. I suspect that's the issue here. -b From gerry.creager at tamu.edu Thu Feb 7 11:26:47 2008 From: gerry.creager at tamu.edu (Gerry Creager) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] PVM on wireless... In-Reply-To: References: <20080206231328.GA1249@neo.csm.ornl.gov> <99265B4F-305C-4E99-AA7F-93CAABE571B8@ee.duke.edu> Message-ID: <47AB5B77.9050802@tamu.edu> FWIW, we saw this with ROCKS and MPICH, a couple of years ago. Took a lot of firewall tweaking, and it's been too many beers to recall the details, to get things working. It is odd. gerry Robert G. Brown wrote: > On Thu, 7 Feb 2008, Bill Rankin wrote: > >> I think that I managed to replicate your problem, Rob. >> >> Laptop running CentoOS5, pvm 3.4.5-7(rpm), wireless ethernet. >> Server running FC6, pvm 3.4.5-7(rpm) >> >> Ssh working fine in both directions, PVM_ROOT and PVM_RSH set >> accordingly. > > Try it with the firewalls completely down and I'll bet it works. > > However, it is REALLY strange that it works with them UP for some > combinations. Or not so strange -- that's what was fooling me, after > all. Perhaps the port ranges being used are varying with version or > chance. > > rgb > >> >> Running "pvm" from the shell on the server and doing an "add " >> at the prompt. >> Prompted for password. >> PVM then hangs waiting to add remote host. >> On the remote host, we see the pvmd running with a "ps". >> >> If I do nothing: the remote pvmd eventually dies and after that the >> command prompt on the server returns with a "1 successful" message, >> but a "conf" command shows that no hosts were added. >> >> Here is the weird part: if after I issue the "add " command, I >> then go over to the laptop and run "pvm" from a shell, the connection >> is made and the hosts are successfully added. >> >> So you may want to try this and see if you get similar behavior. >> >> >> Last datapoint: if from my laptop I attempt to add a host that has PVM >> 3.4.4 (CentOS4 rpm) installed, it starts up fine. So I think that >> it's a bug in 3.4.5-7. I haven't tried it over a wired connection yet. >> >> So you may want to try dropping back to version 3.4.4 on all machines >> and see if that helps. >> >> >> Jim Kohl at ORNL seems to have several patches to 3.4.5, and I'm >> wondering if this issue has already been addressed. >> >> >> -bill > -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From mathog at caltech.edu Thu Feb 7 12:33:08 2008 From: mathog at caltech.edu (David Mathog) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] Re: PVM on wireless... Message-ID: > From: "Robert G. Brown" > However, I've just manage to figure the problem out on my own. It is, > after all, a firewall issue. Good that you sorted that out. A word of warning though, just yesterday I ran into a case where the command to "turn the firewall off", didn't. What it did instead was wall off the machine. This was on a vanilla Mandriva 2007.1 machine, after: /etc/rc.d/init.d/shorewall stop iptables showed that the Input and Forward chains were set to DROP. Of course the only way I could find this out was on the console of that machine, which was luckily only about 5 feet way. This may be what is desired in some instances, but it wasn't what I wanted here. (Plus it would suck big time if that happened on a remotely administered machine.) To really get rid of the firewall /etc/rc.d/init.d/iptables stop was also needed. After that iptables --list showed the expected ACCEPT on all 3 chains and the packets that needed to get through for the test finally did. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From rgb at phy.duke.edu Thu Feb 7 13:42:21 2008 From: rgb at phy.duke.edu (Robert G. Brown) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] PVM on wireless... In-Reply-To: <20080207185304.GA11286@neo.csm.ornl.gov> References: <20080206231328.GA1249@neo.csm.ornl.gov> <20080207185304.GA11286@neo.csm.ornl.gov> Message-ID: On Thu, 7 Feb 2008, kohlja@ornl.gov wrote: > > It would really, really help if man pvm (or man pvmd or man pvm_intro) > > documented a suitable firewall setting that will let PVM function > > without just turning off the firewall altogether. There is no pvm setup > > in /etc/services, for example, no pvm checkbox in the panels managed by > > system-config-firewall in the latest Fedoras, no suggestion as to what > > what protected port(s) or ranges one has to enable explicitly. In fact > > for once even google is failing me -- I'm not finding a lot of > > documentation or remarks by ANYONE on what ports pvm needs open (besides > > ssh, which obviously is open and works). Usually as long as the > > spawning of a network application itself works using an enabled > > protected port (in this case, I would have expected ssh), the secondary > > ports opened in unprotected space just work. Am I wrong in this? Do I > > need to explicitly open more ports somewhere? > > Ah Yes. O.K., so I wish it was that simple, but alas PVM can use as > many ports as you have machines in your cluster, or could use just 1. :-} > > Normally, the master pvmd creates/accepts connections over a small > set of ports, possibly 1, but if PvmRouteDirect is enabled in a PVM > application, then a myriad of direct-connection socket links are > created, to link whichever machines the local PVM application tasks > communicate with, on a demand-driven basis... > > So it's not generally possible to specify an explicit "range" of ports. > However, it _is_ possible to set the "starting" port for this collection, > using the aforementioned "$PVMNETSOCKPORT" environment variable. OK, I'm giving this a try. Although I'd have to ask why pvmd doesn't do the fork thing and clone a single open port on which it listens into a dynamically allocated port that inherits from the open one. In principle one only needs a single port to be open to connect to pretty much any network based application, or so I had thought. At least, I do that in xmlsysd and never have to punch more than one porthole through a firewall. Hmmm, it's working sort of -- looks like I need to open UPD ports, right, not TCP? Having trouble on one host where I've punched the hole but didn't >>locally<< set PVMNETSOCKPORT to match, so I'm trying again with the local environment variable set. Yup, that works. So I'm guessing that pvmd reads it as it starts up wherever. Why does it need to do this on a client? Can't the port(s) be passed from the master when it starts up pvmd? > This sets the first port that PVM will try to use, and all subsequent > ports will usually be consecutive positive increments of that starting > port (i.e. PVMNETSOCKPORT++... :-). > > So in most cases, you could probably plan on opening up a 100 or 1000 > ports _somewhere_ in your firewall, depending on your needs, and then > just tell PVM where to start, using $PVMNETSOCKPORT... > > I've always considered this solution a bit of a kludge, which is why > it doesn't show up in the man pages, but if it works well enough to > save people lots of hassle, then I can add some commentary on it...? Kludge or not, how can you have an environment variable in an application and not provide knowledge of it or instructions on its use in the man page? Something like: PVM requires open ports on target hosts to function. Many hosts are installed with strong firewall rules by default. If you install pvm on a slave and pvm appears to hang when you attempt to add it, eventually timing out without success, consider adding the following to your local personal or system environment (in, for example, ~/.bash_profile on all hosts): PVMNETSOCKPORT=10000 export PVMNETSOCKPORT Then configure your firewall(s) to open a range of udp ports starting at this value, such as 10000-11024 (which need be any larger than the largest number of machines you expect to have in your virtual machine). However a better solution still is to have the daemon fork on a single "permanent" port address > 1024, e.g. 10000, and get a negotiated connection in the upper (non-protected) port space that way. > It may depend on the firewall settings, but a nice "Connection > Refused" would usually go a long way toward diagnosing things, > whereas the more secure firewall alternative of simply > "no response" would only result in a "timed out" PVM message... > > I'm open to suggestions on ways to identify or diagnose the problem...! As I said, document EVERYTHING in the man page(s). It is what it is for. Lots of users do, in fact, RTFM but get frustrated and give up when they try something and it just doesn't work and they can't see why. On the same line, a perennial problem with PVM is getting it to work with rsh and ssh. In fact, half the problems I help people with who randomly write me is getting it to work with one or the other. The internal diagnostics are certainly very helpful, at this point, but it would also be worth adding a new man page like pvm_rsh that does nothing but walk users through the ritual of setting PVM_RSH and establishing appropriate e.g. ssh keys. Just a thought or two. rgb > > Thanks Much for your interest and feedback! > > All the Best, > > Jeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeem ;) > > > I actually help a lot of people get started with PVM (they write me > > offline because I have a template PVM tarball up on my personal website) > > and the more I know, the better I can help them...;-) > > > rgb > > > -- > > Robert G. Brown Phone(cell): 1-919-280-8443 > > Duke University Physics Dept, Box 90305 > > Durham, N.C. 27708-0305 > > Web: http://www.phy.duke.edu/~rgb > > Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php > > Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 > > (:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(: > > James Arthur "Jeeembo" Kohl, Ph.D. "Da Blooos Brathas?! They > Oak Ridge National Laboratory still owe you money, Fool!" > kohlja@ornl.gov > http://www.csm.ornl.gov/~kohl/ Long Live Curtis Blues!!! > > :):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):) > -- Robert G. Brown Phone(cell): 1-919-280-8443 Duke University Physics Dept, Box 90305 Durham, N.C. 27708-0305 Web: http://www.phy.duke.edu/~rgb Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From kohlja at ornl.gov Thu Feb 7 14:11:32 2008 From: kohlja at ornl.gov (kohlja@ornl.gov) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] PVM on wireless... In-Reply-To: References: <20080206231328.GA1249@neo.csm.ornl.gov> <20080207185304.GA11286@neo.csm.ornl.gov> Message-ID: <20080207221132.GA26027@neo.csm.ornl.gov> Hey RGB! Glad the env var worked for you, and sorry PVM is such a port hog. :-] It was all written long before firewalls were in such common usage (heck, it was built around .rhosts for authentication! :). Btw, if I'm not mistaken, I think the master pvmd connects _back_ to the slave pvmd, too, so both sides need proper holes in their firewalls, and corresponding PVMNETSOCKPORT settings...? I understand your basic premise on documenting "all" features in man pages; my resistance for certain features is based on past experiences from users "poking around" and shooting themselves in the foot by trying every little tweak mentioned in the man page, whether they needed it or not...! :-} I guess way back when we learned to hide some features to avoid confusion with novice users, in the hope that more advanced users would be smart enough to stumble onto them (or ask us howto :). I admit this may be an antiquated cynical mentality, and I further concur that PVMNETSOCKPORT is an obvious omission in the basic documentation/faq... Thanks for your suggested text! (And the suggestion to enhance our coverage of rsh/ssh usage... :-) All the Best, Jeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeem ;) On Thu, Feb 07, 2008 at 04:42:21PM -0500, Robert G. Brown wrote: >> > It would really, really help if man pvm (or man pvmd or man pvm_intro) >> > documented a suitable firewall setting that will let PVM function >> > without just turning off the firewall altogether. There is no pvm >> setup >> > in /etc/services, for example, no pvm checkbox in the panels managed by >> > system-config-firewall in the latest Fedoras, no suggestion as to what >> > what protected port(s) or ranges one has to enable explicitly. In fact >> > for once even google is failing me -- I'm not finding a lot of >> > documentation or remarks by ANYONE on what ports pvm needs open >> (besides >> > ssh, which obviously is open and works). Usually as long as the >> > spawning of a network application itself works using an enabled >> > protected port (in this case, I would have expected ssh), the secondary >> > ports opened in unprotected space just work. Am I wrong in this? Do I >> > need to explicitly open more ports somewhere? >> >> Ah Yes. O.K., so I wish it was that simple, but alas PVM can use as >> many ports as you have machines in your cluster, or could use just 1. :-} >> >> Normally, the master pvmd creates/accepts connections over a small >> set of ports, possibly 1, but if PvmRouteDirect is enabled in a PVM >> application, then a myriad of direct-connection socket links are >> created, to link whichever machines the local PVM application tasks >> communicate with, on a demand-driven basis... >> >> So it's not generally possible to specify an explicit "range" of ports. >> However, it _is_ possible to set the "starting" port for this collection, >> using the aforementioned "$PVMNETSOCKPORT" environment variable. > OK, I'm giving this a try. Although I'd have to ask why pvmd doesn't do > the fork thing and clone a single open port on which it listens into a > dynamically allocated port that inherits from the open one. In > principle one only needs a single port to be open to connect to pretty > much any network based application, or so I had thought. At least, I do > that in xmlsysd and never have to punch more than one porthole through a > firewall. > Hmmm, it's working sort of -- looks like I need to open UPD ports, > right, not TCP? Having trouble on one host where I've punched the hole > but didn't >>locally<< set PVMNETSOCKPORT to match, so I'm trying again > with the local environment variable set. > Yup, that works. > So I'm guessing that pvmd reads it as it starts up wherever. Why does > it need to do this on a client? Can't the port(s) be passed from the > master when it starts up pvmd? >> This sets the first port that PVM will try to use, and all subsequent >> ports will usually be consecutive positive increments of that starting >> port (i.e. PVMNETSOCKPORT++... :-). >> >> So in most cases, you could probably plan on opening up a 100 or 1000 >> ports _somewhere_ in your firewall, depending on your needs, and then >> just tell PVM where to start, using $PVMNETSOCKPORT... >> >> I've always considered this solution a bit of a kludge, which is why >> it doesn't show up in the man pages, but if it works well enough to >> save people lots of hassle, then I can add some commentary on it...? > Kludge or not, how can you have an environment variable in an > application and not provide knowledge of it or instructions on its use > in the man page? Something like: > PVM requires open ports on target hosts to function. Many hosts are > installed with strong firewall rules by default. If you install pvm on > a slave and pvm appears to hang when you attempt to add it, eventually > timing out without success, consider adding the following to your local > personal or system environment (in, for example, ~/.bash_profile on all > hosts): > PVMNETSOCKPORT=10000 > export PVMNETSOCKPORT > Then configure your firewall(s) to open a range of udp ports starting > at this value, such as 10000-11024 (which need be any larger than the > largest number of machines you expect to have in your virtual machine). > However a better solution still is to have the daemon fork on a single > "permanent" port address > 1024, e.g. 10000, and get a negotiated > connection in the upper (non-protected) port space that way. >> It may depend on the firewall settings, but a nice "Connection >> Refused" would usually go a long way toward diagnosing things, >> whereas the more secure firewall alternative of simply >> "no response" would only result in a "timed out" PVM message... >> >> I'm open to suggestions on ways to identify or diagnose the problem...! > As I said, document EVERYTHING in the man page(s). It is what it is for. > Lots of users do, in fact, RTFM but get frustrated and give up when they > try something and it just doesn't work and they can't see why. > On the same line, a perennial problem with PVM is getting it to work > with rsh and ssh. In fact, half the problems I help people with who > randomly write me is getting it to work with one or the other. The > internal diagnostics are certainly very helpful, at this point, but it > would also be worth adding a new man page like pvm_rsh that does nothing > but walk users through the ritual of setting PVM_RSH and establishing > appropriate e.g. ssh keys. > Just a thought or two. > rgb >> >> Thanks Much for your interest and feedback! >> >> All the Best, >> >> Jeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeem ;) >> >> > I actually help a lot of people get started with PVM (they write me >> > offline because I have a template PVM tarball up on my personal >> website) >> > and the more I know, the better I can help them...;-) >> >> > rgb >> >> > -- >> > Robert G. Brown Phone(cell): 1-919-280-8443 >> > Duke University Physics Dept, Box 90305 >> > Durham, N.C. 27708-0305 >> > Web: http://www.phy.duke.edu/~rgb >> > Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php >> > Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 >> >> (:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(: >> >> James Arthur "Jeeembo" Kohl, Ph.D. "Da Blooos Brathas?! They >> Oak Ridge National Laboratory still owe you money, Fool!" >> kohlja@ornl.gov >> http://www.csm.ornl.gov/~kohl/ Long Live Curtis Blues!!! >> >> :):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):) >> > -- > Robert G. Brown Phone(cell): 1-919-280-8443 > Duke University Physics Dept, Box 90305 > Durham, N.C. 27708-0305 > Web: http://www.phy.duke.edu/~rgb > Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php > Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From rgb at phy.duke.edu Fri Feb 8 02:35:31 2008 From: rgb at phy.duke.edu (Robert G. Brown) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] PVM on wireless... In-Reply-To: <20080207221132.GA26027@neo.csm.ornl.gov> References: <20080206231328.GA1249@neo.csm.ornl.gov> <20080207185304.GA11286@neo.csm.ornl.gov> <20080207221132.GA26027@neo.csm.ornl.gov> Message-ID: On Thu, 7 Feb 2008, kohlja@ornl.gov wrote: > I admit this may be an antiquated cynical mentality, and I > further concur that PVMNETSOCKPORT is an obvious omission > in the basic documentation/faq... As they say, you can't RTFM if there ain't no FM... (or if the solution exists but isn't there). One is reminded of Dr. Strangelove, where the president (Peter Sellers) has just learned that if the maverick B52 piloted by Slim Pickens gets through, a doomsday device that is supposed to deter first nuclear strikes will go off that will destroy the world. Unfortunately, the Soviet Union didn't actually tell us that it was built. Dr. Strangelove (Peter Sellers), after musing for a moment on the brilliance of the concept, turns and says in an increasingly shrill voice: But...the whole point of the Doomsday Machine...is lost...if you keep it a SECRET. Why didn't you tell the world, eh? Hmmm...;-) rgb > Thanks for your suggested text! (And the suggestion to > enhance our coverage of rsh/ssh usage... :-) Ya, well. Just now finished telling the umptieth would-be PVM user how to go about it in an email message, augmenting further online docs such as this one: http://www.uow.edu.au/~suresh/web/cfamily/pvm.html which is actually pretty decent, although I generally use the ssh default dsa instead of rsa since on linux boxes it invariably works. But better than forcing each user to employ google to snarf out solutions to each problem they encounter, how much better to write a really nice "Getting Started with PVM" or perhaps better still, a "PVM HOWTO" on tldp.org. Publish there, and be sure to include a copy in plain sight in /usr/share/pvm3/PVM_HOWTO. Truthfully, good documentation, especially a walkthrough tutorial on getting started (including sample code or links to sample code) that takes a would-be user from "yum install pvm\*" to executing a Real Parallel Program (however trivial) on a two node cluster would really encourage the use of the library. Adding a bit more (such as a PVM program development template) would be only icing on the cake, so to speak. If I had the time I'd write it myself. I've already got a project_pvm program template up on the web, but it is sadly underdocumented through the setup of PVM itself. rgb > > All the Best, > > Jeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeem ;) > > On Thu, Feb 07, 2008 at 04:42:21PM -0500, Robert G. Brown wrote: > >> > It would really, really help if man pvm (or man pvmd or man pvm_intro) > >> > documented a suitable firewall setting that will let PVM function > >> > without just turning off the firewall altogether. There is no pvm > >> setup > >> > in /etc/services, for example, no pvm checkbox in the panels managed by > >> > system-config-firewall in the latest Fedoras, no suggestion as to what > >> > what protected port(s) or ranges one has to enable explicitly. In fact > >> > for once even google is failing me -- I'm not finding a lot of > >> > documentation or remarks by ANYONE on what ports pvm needs open > >> (besides > >> > ssh, which obviously is open and works). Usually as long as the > >> > spawning of a network application itself works using an enabled > >> > protected port (in this case, I would have expected ssh), the secondary > >> > ports opened in unprotected space just work. Am I wrong in this? Do I > >> > need to explicitly open more ports somewhere? > >> > >> Ah Yes. O.K., so I wish it was that simple, but alas PVM can use as > >> many ports as you have machines in your cluster, or could use just 1. :-} > >> > >> Normally, the master pvmd creates/accepts connections over a small > >> set of ports, possibly 1, but if PvmRouteDirect is enabled in a PVM > >> application, then a myriad of direct-connection socket links are > >> created, to link whichever machines the local PVM application tasks > >> communicate with, on a demand-driven basis... > >> > >> So it's not generally possible to specify an explicit "range" of ports. > >> However, it _is_ possible to set the "starting" port for this collection, > >> using the aforementioned "$PVMNETSOCKPORT" environment variable. > > > OK, I'm giving this a try. Although I'd have to ask why pvmd doesn't do > > the fork thing and clone a single open port on which it listens into a > > dynamically allocated port that inherits from the open one. In > > principle one only needs a single port to be open to connect to pretty > > much any network based application, or so I had thought. At least, I do > > that in xmlsysd and never have to punch more than one porthole through a > > firewall. > > > Hmmm, it's working sort of -- looks like I need to open UPD ports, > > right, not TCP? Having trouble on one host where I've punched the hole > > but didn't >>locally<< set PVMNETSOCKPORT to match, so I'm trying again > > with the local environment variable set. > > > Yup, that works. > > > So I'm guessing that pvmd reads it as it starts up wherever. Why does > > it need to do this on a client? Can't the port(s) be passed from the > > master when it starts up pvmd? > > >> This sets the first port that PVM will try to use, and all subsequent > >> ports will usually be consecutive positive increments of that starting > >> port (i.e. PVMNETSOCKPORT++... :-). > >> > >> So in most cases, you could probably plan on opening up a 100 or 1000 > >> ports _somewhere_ in your firewall, depending on your needs, and then > >> just tell PVM where to start, using $PVMNETSOCKPORT... > >> > >> I've always considered this solution a bit of a kludge, which is why > >> it doesn't show up in the man pages, but if it works well enough to > >> save people lots of hassle, then I can add some commentary on it...? > > > Kludge or not, how can you have an environment variable in an > > application and not provide knowledge of it or instructions on its use > > in the man page? Something like: > > > PVM requires open ports on target hosts to function. Many hosts are > > installed with strong firewall rules by default. If you install pvm on > > a slave and pvm appears to hang when you attempt to add it, eventually > > timing out without success, consider adding the following to your local > > personal or system environment (in, for example, ~/.bash_profile on all > > hosts): > > > PVMNETSOCKPORT=10000 > > export PVMNETSOCKPORT > > > Then configure your firewall(s) to open a range of udp ports starting > > at this value, such as 10000-11024 (which need be any larger than the > > largest number of machines you expect to have in your virtual machine). > > > However a better solution still is to have the daemon fork on a single > > "permanent" port address > 1024, e.g. 10000, and get a negotiated > > connection in the upper (non-protected) port space that way. > > >> It may depend on the firewall settings, but a nice "Connection > >> Refused" would usually go a long way toward diagnosing things, > >> whereas the more secure firewall alternative of simply > >> "no response" would only result in a "timed out" PVM message... > >> > >> I'm open to suggestions on ways to identify or diagnose the problem...! > > > As I said, document EVERYTHING in the man page(s). It is what it is for. > > Lots of users do, in fact, RTFM but get frustrated and give up when they > > try something and it just doesn't work and they can't see why. > > > On the same line, a perennial problem with PVM is getting it to work > > with rsh and ssh. In fact, half the problems I help people with who > > randomly write me is getting it to work with one or the other. The > > internal diagnostics are certainly very helpful, at this point, but it > > would also be worth adding a new man page like pvm_rsh that does nothing > > but walk users through the ritual of setting PVM_RSH and establishing > > appropriate e.g. ssh keys. > > > Just a thought or two. > > > rgb > > >> > >> Thanks Much for your interest and feedback! > >> > >> All the Best, > >> > >> Jeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeem ;) > >> > >> > I actually help a lot of people get started with PVM (they write me > >> > offline because I have a template PVM tarball up on my personal > >> website) > >> > and the more I know, the better I can help them...;-) > >> > >> > rgb > >> > >> > -- > >> > Robert G. Brown Phone(cell): 1-919-280-8443 > >> > Duke University Physics Dept, Box 90305 > >> > Durham, N.C. 27708-0305 > >> > Web: http://www.phy.duke.edu/~rgb > >> > Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php > >> > Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 > >> > >> (:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(: > >> > >> James Arthur "Jeeembo" Kohl, Ph.D. "Da Blooos Brathas?! They > >> Oak Ridge National Laboratory still owe you money, Fool!" > >> kohlja@ornl.gov > >> http://www.csm.ornl.gov/~kohl/ Long Live Curtis Blues!!! > >> > >> :):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):) > >> > > > -- > > Robert G. Brown Phone(cell): 1-919-280-8443 > > Duke University Physics Dept, Box 90305 > > Durham, N.C. 27708-0305 > > Web: http://www.phy.duke.edu/~rgb > > Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php > > Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 > -- Robert G. Brown Phone(cell): 1-919-280-8443 Duke University Physics Dept, Box 90305 Durham, N.C. 27708-0305 Web: http://www.phy.duke.edu/~rgb Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From bernard at vanhpc.org Mon Feb 4 12:43:13 2008 From: bernard at vanhpc.org (Bernard Li) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] TIPC in a Beowulf? In-Reply-To: References: <5a1205b30802031035wf416840s115b6a5278c20812@mail.gmail.com> Message-ID: Hi Mark: On 2/3/08, Mark Hahn wrote: > I _think_ I'm not confusing TIPC with SCTP (which also seems to be rather > telecom-oriented.) Talking about SCTP, FYI both MPICH and Open MPI supports it. For more information, please see: http://www.cs.ubc.ca/labs/dsg/mpi-sctp/ Cheers, Bernard From gmichal at uow.edu.au Tue Feb 5 04:33:32 2008 From: gmichal at uow.edu.au (Guillaume Michal) Date: Thu Aug 28 01:06:49 2008 Subject: [Beowulf] Rocks 4.3 and user accounts Message-ID: <1202214812.6258.25.camel@earth> Hi all, ( sorry for the duplicate mail, previous one was sent w