Marist Beowulf Setup
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Donald Becker becker at scyld.comThu Nov 29 06:01:50 PST 2001
- Previous message: Marist Beowulf Setup
- Next message: Portland High Performance Fortran pghpf on Scyld cluster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, 29 Nov 2001, Richard C Ferri wrote: > Anthony Sofia <anthony at dryhump.net> on 11/28/2001 12:52:46 PM > > I have a couple of problems/questions that > you might be able to help with. (This is all based on scyld) > > The first problem is the beoserv and bpmater daemons are binding > to -1 instead of an address(192.168.1.1). The Scyld Beowulf system has special host names for cluster components. .0, .1 ... Compute (slave) nodes .-1 Front-end (master) nodes Note the leading ".", which makes this a hostname instead of a number. This hostname syntax is a valid local text hostname for library routines. It won't be misinterpreted as a valid Internet DNS hostname, or an integer which would be interpreted as an IP number. With this hostname form we can avoid the overhead or serialization of hostname lookups by algorithmically translating to an IP address. We parse the number and add it to the base IP address of the cluster nodes, usually 192.168.1.100. (Implementation note: the correct netmask is required for this to work with more than 154 hosts.) > THe nodes are able to get > their IP addresses via rarp, but when it tries to connect to > the master node(192.168.1.1:1555) to get the second level > boot image, the slave nodes stalls. The leading causes of this are A network problem Switches set to forced-full-duplex won't work because there is no way to set driver parameters during boot Report the device driver version and detection message. The driver errata list is always changing with the introduction of new, not-quite-compatible chips A version mismatch between the master and boot disks Due to a changes in the Scyld boot protocol, the boot floppy/CD-ROM must match the master. > When doing a netstat on the > master node, it says an established tcp connection exsists > between .-1:1555 and .0:(some port). During this, no data is > being transfered over the network, so i am sceptical if the > tcp connection actually exsists. Yes, netstat is accurately reporting the connection. An established connection indicates that at least a few packets got through. That reduces the likelihood of a device driver problem, but you might still have a bogus switch configuration. > I am going to start looking into this, but I thought you > might have a quick answer that would make me not have to > dig through code and strace output all afternoon. =) Using 'strace' likely won't be as useful as 'tcpdump'. But just monitoring network traffic with /proc/net/dev should give a good indication of what is occurring. Donald Becker becker at scyld.com Scyld Computing Corporation http://www.scyld.com 410 Severn Ave. Suite 210 Second Generation Beowulf Clusters Annapolis MD 21403 410-990-9993
- Previous message: Marist Beowulf Setup
- Next message: Portland High Performance Fortran pghpf on Scyld cluster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
