[Beowulf] TCP connect error: ECONNREFUSED.
David Simas
dgs at slac.stanford.edu
Tue Mar 31 10:57:03 PDT 2009
On Mon, Mar 30, 2009 at 02:14:50PM +0100, J?rg Sa?mannshausen wrote:
> Dear all,
>
> I am having this rather anoying problem with the parallel execution of
> one of the programs (GAMESS US version) on our cluster. The error
> message is:
>
> TCP connect error: ECONNREFUSED.
> TCP: Connect failed. comp10 -> comp02.chem.strath.ac.uk:42208.
> A fatal error occurred on DDI Process 0.
> TCP connect error: ECONNREFUSED.
> TCP: Connect failed. comp10 -> comp02.chem.strath.ac.uk:42208.
> A fatal error occurred on DDI Process 60.
> TCP connect error: ECONNREFUSED.
> TCP: Connect failed. comp10 -> comp02.chem.strath.ac.uk:42208.
> A fatal error occurred on DDI Process 2.
> TCP connect error: ECONNREFUSED.
>
> [ ... ]
>
> Eventually, the ddicick tips over and the whole thing crashes. The
> program is using rsh (yes, I know, security, I did not install the
> cluster!) and I can rsh comp10 -> comp02 and there is no firewall
> installed between the nodes (at least, not that I am aware of). Trying
> to run the same job with the same number of nodes will fail X times and
> at X+1 suddenly work. I could not work out a pattern for that (other
> that I get exponentially annoyed). Right now, there is only one gigabit
> network connecting the cluster, so nfs, mpi etc. is all running over one
> interface (again, I did not set up the cluster).
How rapidly are these rsh connection attempts occuring? The rsh protocol
requires connections from privileged ports - less than 1024. If a host
attempts to make more than 1024 to another host in less than TCP TIME-WAIT
seconds, it will run out ports and the connections will fail. I've seen
this occur with parallel applications using rsh.
David S.
>
> I have run out of ideas of where to look. I checked (as quickly as
> possible) at some nodes with netstat, the ddicick program is acutally
> running. Changing to ssh did not solve the problem.
>
> I would appreciate any feedback as it is highly anyoing to wait Y days
> to get the job running and then it crashes.
>
> All the best from Glasgow!
>
> J?rg
>
>
> --
> *************************************************************
> J?rg Sa?mannshausen
> Research Fellow
> University of Strathclyde
> Department of Pure and Applied Chemistry
> 295 Cathedral St.
> Glasgow
> G1 1XL
>
> email: jorg.sassmannshausen at strath.ac.uk
> web: http://sassy.formativ.net
>
> Please avoid sending me Word or PowerPoint attachments.
> See http://www.gnu.org/philosophy/no-word-attachments.html
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list