Marist Beowulf Setup
becker at scyld.com
Thu Nov 29 06:01:50 PST 2001
On Thu, 29 Nov 2001, Richard C Ferri wrote:
> Anthony Sofia <anthony at dryhump.net> on 11/28/2001 12:52:46 PM
> I have a couple of problems/questions that
> you might be able to help with. (This is all based on scyld)
> The first problem is the beoserv and bpmater daemons are binding
> to -1 instead of an address(192.168.1.1).
The Scyld Beowulf system has special host names for cluster components.
.0, .1 ... Compute (slave) nodes
.-1 Front-end (master) nodes
Note the leading ".", which makes this a hostname instead of a number.
This hostname syntax is a valid local text hostname for library
routines. It won't be misinterpreted as a valid Internet DNS hostname,
or an integer which would be interpreted as an IP number.
With this hostname form we can avoid the overhead or serialization of
hostname lookups by algorithmically translating to an IP address. We
parse the number and add it to the base IP address of the cluster nodes,
usually 192.168.1.100. (Implementation note: the correct netmask is
required for this to work with more than 154 hosts.)
> THe nodes are able to get
> their IP addresses via rarp, but when it tries to connect to
> the master node(192.168.1.1:1555) to get the second level
> boot image, the slave nodes stalls.
The leading causes of this are
A network problem
Switches set to forced-full-duplex won't work because there is no
way to set driver parameters during boot
Report the device driver version and detection message.
The driver errata list is always changing with the introduction
of new, not-quite-compatible chips
A version mismatch between the master and boot disks
Due to a changes in the Scyld boot protocol, the boot
floppy/CD-ROM must match the master.
> When doing a netstat on the
> master node, it says an established tcp connection exsists
> between .-1:1555 and .0:(some port). During this, no data is
> being transfered over the network, so i am sceptical if the
> tcp connection actually exsists.
Yes, netstat is accurately reporting the connection. An established
connection indicates that at least a few packets got through. That
reduces the likelihood of a device driver problem, but you might still
have a bogus switch configuration.
> I am going to start looking into this, but I thought you
> might have a quick answer that would make me not have to
> dig through code and strace output all afternoon. =)
Using 'strace' likely won't be as useful as 'tcpdump'. But just
monitoring network traffic with /proc/net/dev should give a good
indication of what is occurring.
Donald Becker becker at scyld.com
Scyld Computing Corporation http://www.scyld.com
410 Severn Ave. Suite 210 Second Generation Beowulf Clusters
Annapolis MD 21403 410-990-9993
More information about the Beowulf