[Beowulf] Re: TCP connect error: ECONNREFUSED. - solved-
Jörg Saßmannshausen
jorg.sassmannshausen at strath.ac.uk
Wed Apr 15 02:24:25 PDT 2009
Dear all,
some time ago I contacted the list regarding the above problem.
I would like to thank all who contributed towards the solution, finally
I found out what is going on.
The problem lies in the hostlist (which contains the nodes where the job
is going to run on, so the machinefile if you like) and in particular
the order of it.
PBS type schedulers (I have used TORQUE before) are using the
$PBS_NODEFILE which you only need to read out. I was looking at the
internet for something similar but all I could find was that SGE
apparently writes out the hostfile in a file. So I used that. What I was
not aware of at the time is that the vendor has its own script to
generate that file and (unfortunately for me) that contains the command
'sort'. So, the order of the nodes gets changed. However, ddikick, the
program which is doing the parallelisation, seems to be quite fussy
about that as the first node will be the master, initiating all the
other processes. Unfortunately, as the order is different from what SGE
supplied, that leads to the bizzar situation that SGE is starting of the
process on a 'slave' (with respect from ddikick) and hence
ddikick-master and SGE-master will never speak to each other. The
solution was to use the $PE_HOSTFILE and read out the nodes from there,
same as I do with the $PBS_NODEFILE. It could not be any easier _if_ I
had known on beforehand.
I thought I share that with you, in case somebody is searching the list
and founds my thread. :-)
All the best from Glasgow!
Jörg
--
*************************************************************
Jörg Saßmannshausen
Research Fellow
University of Strathclyde
Department of Pure and Applied Chemistry
295 Cathedral St.
Glasgow
G1 1XL
email: jorg.sassmannshausen at strath.ac.uk
web: http://sassy.formativ.net
Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html
More information about the Beowulf
mailing list