[Beowulf] BLACS Errors?
Ashton Peters
ape20 at student.canterbury.ac.nz
Wed Aug 4 19:14:19 PDT 2004
I am having trouble with BLACS calls within a very simple Fortran 90
program on a ten-node dual-Opteron Rocks Linux 3.2.0 cluster. We have
the PGI CDK 5.1 installed.
I have written a simple Fortran program to test broadcast sends and
receives using BLACS. The full code of this program is attached to the
end of this message.
I compile this code with:
$ pgf90 -Mscalapack -o simple.opt simple.f
... and run it with:
$ mpirun -np X simple.opt
The code will run fine with 2 or 3 processors, with any vector length
(n) I choose. Below is the screen output from a successful 3 processor
run:
[ape20 at colossus fwdsolvers]$ pgf90 -Mscalapack -o simple.opt simple.f
[ape20 at colossus fwdsolvers]$ mpirun -np 3 simple.opt
ape20 at compute-0-0's password:
ape20 at compute-0-1's password:
Process 0 is alive at grid position (0,0)
For this test n =1000
Array sent from process 0
Process 1 is alive at grid position (0,1)
Array received at process 1
Process 2 is alive at grid position (0,2)
Array received at process 2
[ape20 at colossus fwdsolvers]$
However, if I try to run with -np 4 or greater, I get the following
screen output:
[ape20 at colossus fwdsolvers]$ pgf90 -Mscalapack -o simple.opt simple.f
[ape20 at colossus fwdsolvers]$ mpirun -np 4 simple.opt
ape20 at compute-0-0's password:
ape20 at compute-0-1's password:
ape20 at compute-0-2's password:
Process 0 is alive at grid position (0,0)
For this test n =1000
Array sent from process 0
Process 2 is alive at grid position (0,2)
Array received at process 2
Process 1 is alive at grid position (0,1)
Array received at process 1
bm_list_28551: (7.738281) wakeup_slave: unable to interrupt slave 0 pid
28550
Received disconnect from 10.255.255.252: Command terminated on signal
13.
[ape20 at colossus fwdsolvers]$ rm_l_1_19376: (5.019531) net_send: could
not write to fd=6, errno = 9
rm_l_1_19376: p4_error: net_send write: -1
p4_error: latest msg from perror: Bad file descriptor
rm_l_2_10837: (2.453125) net_send: could not write to fd=6, errno = 9
rm_l_2_10837: p4_error: net_send write: -1
p4_error: latest msg from perror: Bad file descriptor
[ape20 at colossus fwdsolvers]$
Does anyone have an idea what these error messages mean, and how I can
fix them? I am a beginner with BLACS, so it is possible that my Fortran
code code has not initialized it correctly, but I have checked it
against many tutorial examples and it seems OK.
Many thanks in advance,
Ashton Peters
Center for Bioengineering
University of Canterbury
Christchurch, New Zealand
----- FORTRAN CODE -----
program SIMPLE
ccccc VERY SIMPLE BLACS TEST PROGRAM ccccc
ccccc Declare variables
integer iam,nprocs,nprows,npcols,ctxt,myprow,mypcol
integer junk(5000)
ccccc Total number of processes
call BLACS_PINFO(iam,nprocs)
ccccc Define size of process grid (in this case a single row)
nprows=1
npcols=nprocs
ccccc Get the system context
call BLACS_GET(0,0,ctxt)
ccccc Initialise the process grid
call BLACS_GRIDINIT(ctxt,'Row',nprows,npcols)
call BLACS_GRIDINFO(ctxt,nprows,npcols,myprow,mypcol)
ccccc Get each process to check in with grid coordinates
10 format(a8,i2,a28,i1,a1,i1,a1)
print 10,'Process',iam,
& 'is alive at grid position (',myprow,',',mypcol,')'
ccccc Master generates integer array and broadcasts to all slaves
if((myprow.eq.0).and.(mypcol.eq.0)) then
n=1000
call IGEBS2D(ctxt,'All',' ',1,1,n,1)
20 format(a18,i4)
print 20,'For this test n =',n
do i=1,n
junk(i)=i
enddo
call IGEBS2D(ctxt,'All',' ',n,1,junk,5000)
print 30,'Array sent from process ',iam
ccccc End master code
endif
ccccc Slaves receive info and check it is correct
if((myprow.ne.0).or.(mypcol.ne.0)) then
call IGEBR2D(ctxt,'All',' ',1,1,n,1,0,0)
call IGEBR2D(ctxt,'All',' ',n,1,junk,2500,0,0)
30 format(a27,i2)
if((junk(1).eq.1).and.(junk(n).eq.n)) then
print 30,'Array received at process ',iam
else
print 30,'Error receiving at process',iam
endif
ccccc End slave code
endif
ccccc End program
end
----- END OF FORTRAN CODE -----
More information about the Beowulf
mailing list