[Beowulf] LAM trouble
Jeffrey B. Layton
laytonjb at charter.net
Tue Apr 11 15:14:26 PDT 2006
Howdy!
I apologize for posting this problem here, but I tried the LAM
list and didn't hear anything, so I thought I would cast my net
a bit wider in search of help.
I'm having trouble starting an MPI code (NPB bt) that was
built with PGI 6.1 and LAM-7.1.2. I get the following messages
when I try to start the code (lamboot):
n-1<24201> ssi:boot:base:linear: booting n0 (n2004)
n-1<24201> ssi:boot:base:linear: booting n1 (n2005)
n-1<24201> ssi:boot:base:linear: booting n2 (n2006)
n-1<24201> ssi:boot:base:linear: booting n3 (n2007)
n-1<24201> ssi:boot:base:linear: booting n4 (n2008)
n-1<24201> ssi:boot:base:linear: booting n5 (n2009)
n-1<24201> ssi:boot:base:linear: booting n6 (n2010)
n-1<24201> ssi:boot:base:linear: booting n7 (n2011)
n-1<24201> ssi:boot:base:linear: finished
-----------------------------------------------------------------------------
It seems that [at least] one of the processes that was started with
mpirun chose a different RPI than its peers. For example, at least
the following two processes mismatched in their RPI selections:
MPI_COMM_WORLD rank 0: tcp (v7.1.0)
MPI_COMM_WORLD rank 3: usysv (v7.1.0)
All MPI processes must choose the same RPI module and version when
they start. Check your SSI settings and/or the local environment
variables on each node.
I'm using PBS to start the job and here are the relevant parts
of the script:
NET=tcp
lamboot -b -v -ssh rpi $NET $PBS_NODEFILE
mpirun -O -v C ./${EXE} >> ${OUTFILE}
lamhalt
where $EXE and $OUTFILE are defined appropriately in the
script.
Does anyone have any ideas?
TIA!
Jeff
More information about the Beowulf
mailing list