Linda problems (under work w/G98)

Mikhail Kuzminsky kus at free.net
Fri Nov 22 11:35:22 PST 2002


I've installed binary Linda 6.2 version (for homogenous cluster)
 for our Giagbit Ethernet-based cluster
(nodes works under RH 7.2). The main task of Linda for us
is to support inter-nodes parallelization of one application
(binary version of Gaussian -98 Rev.A11). 
But we found that this application starts parallel processes
on cluster nodes and "hang-ups" because of Linda/network problems
(it looks that the problem is not w/G98 itself). 

I'll be very appreciate in any ideas what may be the
real source of our problem !
 
A bit more detailed description of our situation follows.

1) We tested G98+Linda on 2 "equal" SMP nodes w/default
Linda configuration file, i.e. w/Tsnet.Appl.maxprocspernode: 1
(i.e. Linda starts 1 master process on master node, and
1 additional process on 2nd node). The clocks on both
nodes are synchronized through ntpd. NFS is not used.

2) This nodes has equal .tsnet.config files in home directories
of the same user on different nodes. This files has 1 string:

Tsnet.Appl.nodelist: host1 host2

3) At start of g98l (application executable) on host1 
we see following ntsnet messages:
...
ntsnet starting master process on host1
ntsnet starting 1 worker on host2
ntsnet waiting for Linda group messages 
ntsnet received Linda group message: group has 2 members

... and now we see parallel processes working on both nodes,
but it looks that they can't exchange (send/receive) the messages: they are
mainly in waiting state, strace gives
- select/gettimeofday/sendto/recvfrom (last -w/"resource temporary unavailable") syscalls in a loop - on host1 (master)
- select/gettimeofday/sendto syscalls in a loop - on host2

After some time interval we see on host1 the message:
ntsnet: worker on node host2 exited abnormally       
- and the run is finished.

4) At start of g98l on host2 (i.e. host2 is now master node)
the situation is not the same (not symmetrical):

ntsnet starting master process on host2
ntsnet starting 1 worker on host1
ntsnet waiting for Linda group message
Linda Error: node host1(0) warning: sendto failed: Network is unreachable
ntsnet received Linda group message: group has 2 members
... and then a lot of Linda error messages - that Network is unreachable.

And as in previous case we see parallel (waiting) processes on both nodes.

5) At the time when parallel processes on both nodes can't
"negotiate" successfully, ping and rsh between this nodes works
normally. Ping gives various delays for host1-->host2 and host2-->host1
(90-130 microseconds), but it looks appropriate. Ifconfig
says that there is no network errors. 

Yours
Mikhail Kuzminsky
Zelinsky Inst. of Organic Chemistry
Moscow


 




More information about the Beowulf mailing list