Marist Beowulf Setup

Donald Becker becker at scyld.com
Thu Nov 29 06:01:50 PST 2001


On Thu, 29 Nov 2001, Richard C Ferri wrote:

> Anthony Sofia <anthony at dryhump.net> on 11/28/2001 12:52:46 PM
>
> I have a couple of problems/questions that
> you might be able to help with. (This is all based on scyld)
> 
> The first problem is the beoserv and bpmater daemons are binding
> to -1 instead of an address(192.168.1.1).

The Scyld Beowulf system has special host names for cluster components.

.0, .1  ...   Compute (slave) nodes
.-1	      Front-end (master) nodes

Note the leading ".", which makes this a hostname instead of a number.

This hostname syntax is a valid local text hostname for library
routines.  It won't be misinterpreted as a valid Internet DNS hostname,
or an integer which would be interpreted as an IP number.

With this hostname form we can avoid the overhead or serialization of
hostname lookups by algorithmically translating to an IP address.  We
parse the number and add it to the base IP address of the cluster nodes,
usually 192.168.1.100.  (Implementation note: the correct netmask is
required for this to work with more than 154 hosts.)

> THe nodes are able to get
> their IP addresses via rarp, but when it tries to connect to
> the master node(192.168.1.1:1555) to get the second level
> boot image, the slave nodes stalls.

The leading causes of this are
  A network problem
       Switches set to forced-full-duplex won't work because there is no
         way to set driver parameters during boot
       Report the device driver version and detection message.
         The driver errata list is always changing with the introduction
	 of new, not-quite-compatible chips
  A version mismatch between the master and boot disks
      Due to a changes in the Scyld boot protocol, the boot
      floppy/CD-ROM must match the master.

> When doing a netstat on the
> master node, it says an established tcp connection exsists
> between .-1:1555 and .0:(some port). During this, no data is
> being transfered over the network, so i am sceptical if the
> tcp connection actually exsists.

Yes, netstat is accurately reporting the connection.  An established
connection indicates that at least a few packets got through.  That
reduces the likelihood of a device driver problem, but you might still
have a bogus switch configuration.  

> I am going to start looking into this, but I thought you
> might have a quick answer that would make me not have to
> dig through code and strace output all afternoon. =)

Using 'strace' likely won't be as useful as 'tcpdump'.  But just
monitoring network traffic with /proc/net/dev should give a good
indication of what is occurring.

Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993




More information about the Beowulf mailing list