[Beowulf] TCP connect error: ECONNREFUSED.

Jörg Saßmannshausen jorg.sassmannshausen at strath.ac.uk
Mon Mar 30 06:14:50 PDT 2009


Dear all,

I am having this rather anoying problem with the parallel execution of 
one of the programs (GAMESS US version) on our cluster. The error 
message is:

  TCP connect error: ECONNREFUSED.
  TCP: Connect failed. comp10 -> comp02.chem.strath.ac.uk:42208.
  A fatal error occurred on DDI Process 0.
  TCP connect error: ECONNREFUSED.
  TCP: Connect failed. comp10 -> comp02.chem.strath.ac.uk:42208.
  A fatal error occurred on DDI Process 60.
  TCP connect error: ECONNREFUSED.
  TCP: Connect failed. comp10 -> comp02.chem.strath.ac.uk:42208.
  A fatal error occurred on DDI Process 2.
  TCP connect error: ECONNREFUSED.

[ ... ]

Eventually, the ddicick tips over and the whole thing crashes. The 
program is using rsh (yes, I know, security, I did not install the 
cluster!) and I can rsh comp10 -> comp02 and there is no firewall 
installed between the nodes (at least, not that I am aware of). Trying 
to run the same job with the same number of nodes will fail X times and 
at X+1 suddenly work. I could not work out a pattern for that (other 
that I get exponentially annoyed). Right now, there is only one gigabit 
network connecting the cluster, so nfs, mpi etc. is all running over one 
interface (again, I did not set up the cluster).

I have run out of ideas of where to look. I checked (as quickly as 
possible) at some nodes with netstat, the ddicick program is acutally 
running. Changing to ssh did not solve the problem.

I would appreciate any feedback as it is highly anyoing to wait Y days 
to get the job running and then it crashes.

All the best from Glasgow!

Jörg


-- 
*************************************************************
Jörg Saßmannshausen
Research Fellow
University of Strathclyde
Department of Pure and Applied Chemistry
295 Cathedral St.
Glasgow
G1 1XL

email: jorg.sassmannshausen at strath.ac.uk
web: http://sassy.formativ.net

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html






More information about the Beowulf mailing list