[Beowulf] PVM on wireless...

kohlja at ornl.gov kohlja at ornl.gov
Wed Feb 6 15:13:28 PST 2008


Hey Gang!

Sounds like you're having some "fun" with PVM over wireless...?  :-)

(A buddy (Wael Elwasif) forwarded your discussion to me;
please always feel free to copy "pvm at msr.csm.ornl.gov"
with PVM inquiries when you get stuck.  I try to be
pretty responsive, though this is all unfunded work now... :)

So, the master's network interface/IP selection was my first
guess, too, after reading about your situation, but this
email below would seem to eliminate that possibility...

Just to be sure though, I assume you're starting PVM
on the master host with the "-nfoo" host name argument,
to choose the desired network interface/IP address,
and that the /tmp/pvml.<uid> log file on the master
reflects/verifies this IP...?  :)

Are there any error messages in the PVM log files
on either the master or the slave machines...?

(Btw, which $PVM_ARCH are we talking about here,
"LINUX" or "BEOLIN"...? :)

There are a few weird scenarios under which PVM will
quietly drop or ignore packets coming from the slave
daemons, when the IP doesn't appear to match what
the master expects... ("to serve you better" and
protect against external intrusions, ha ha ha... :)

As far as timing out/latency, which was another line
of your discussion I read through, I don't _think_ PVM
cares about the fine-grained latency that you're talking
about, between wireless and wired...

The daemons are on a nice long timeout, like 3 _minutes_
before they assume something died...

And for startup, the master doesn't strictly "wait" for
the slaves to connect, it merely provides them with the
proper socket address for where to connect themselves up...
(hence the option you've mentioned about manually starting
a slave daemon, and having it just connect up to the master)

So what about firewalls or blocked ports...?

Does the wireless network leave the PVM ports open?
(The port number is chosen at random by the system,
unless the "$PVMNETSOCKPORT" environment variable
is set with a starting port number for the desired
range...)

Anything in the master's regular system logs
(or the slave's PVM log file) about "Connection
Refused"...?

Just an idear.  Please lemme know if this is all
still a dead end.

(And send along any error messages from the PVM logs...! :-)

Good Luck & "Long Live PVM"...!  :)

	Jeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeem ;)
	(a.k.a. Jim Kohl, kohlja at ornl.gov :)

  > From: "Robert G. Brown" <rgb at phy.duke.edu>
  > Date: Wed, 6 Feb 2008 13:21:55 -0500 (EST)
  > Subject: Re: [Beowulf] PVM on wireless...
  > To: Bill Rankin <wrankin at ee.duke.edu>
  > Cc: Beowulf Mailing List <beowulf at beowulf.org>
  > Message-ID: <Pine.LNX.4.64.0802061312001.20835 at cain.rgb.private.net>
  > Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
  > 
  > On Wed, 6 Feb 2008, Bill Rankin wrote:
  > 
  > > Hey Rob,
  > >
  > > Could it be a node naming issue where the wireless IP does not resolve
  > > to
  > > the same address as that used in the machinefile?  I seem to recall a
  > > similar issue back when we PVM on machines with multiple network
  > > connections.
  > 
  > pvmd is actually starting up on the target machine -- it works that far.
  > The master node IP number is correct, as is the slave IP number (both
  > visible as arguments to pvmd).  The name I'm using is the one associated
  > with the wireless interface in question, both machines ping in all four
  > directions by name with the correct internet address.  All my machines
  > are configured more or less identically, use the same environment
  > variables, support transparent ssh command execution (which obviously
  > works even in PVM as the daemon is being spawned on the correct target).
  > 
  > The wireless interfaces have the right MTU and look exactly like the
  > ethernet devices they in fact are to the kernel AFAIK.  In every other
  > aspect I've ever tested, including my own homemade socket code, response
  > to both tcp and udp daemons, ability to mount NFS, support ssh, and so
  > on and so forth, they behave like TCP/IP sockets over ethernet devices
  > as far as systems calls go -- they use the same interface, and the whole
  > point of OSI/ISO is that code should not depend on the hardware layer
  > and in general on even a roughly posix compliant machine using standard
  > devices and e.g. the socket API it doesn't.
  > 
  > Last time I encountered this, I actually cranked up the -d0x0 stuff and
  > "watched" as the system went through to where it hung in the middle of
  > doing some part of the post-spawn handshaking.
  > 
  > I suspect a race condition, probably caused by using raw UDP with some
  > assumption of latency during the handshake.  The one way I can think of
  > that the two connections differ is in their latency -- even the
  > bandwidth of wireless is every bit as great as 10B2 networks I've run
  > PVM on in years past (on proportionally slower CPUs, of course).  If the
  > master or slave send out an acknowledgement packet either before the
  > window where the other can receive it or after it has grown bored and
  > stopped listening, it might fail to properly bind or something.  It
  > seems like it would be a bug, not a feature, but if I were feeling
  > infinitely masochistic and were to wander down into Other People's
  > Source (ouch!) to try to debug this, that's what I'd look for first.
  > 
  > Any PVM developers still on list?  Any comments from them?
  > 
  >     rgb

(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:(:

   James Arthur "Jeeembo" Kohl, Ph.D.     "Da Blooos Brathas?!  They
   Oak Ridge National Laboratory              still owe you money, Fool!"
   kohlja at ornl.gov
   http://www.csm.ornl.gov/~kohl/          Long Live Curtis Blues!!!

:):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):)



More information about the Beowulf mailing list