[Beowulf] mpich2 complain about nodes that i dont use
r.li at qmul.ac.uk
Fri Sep 30 06:58:42 PDT 2005
I am using mpich2 on linux cluster, I kept having errors like the following
rank 14 in job 2 cn128_57798 caused collective abort of all ranks
exit status of rank 14: killed by signal 9
mpdrun_cn145: cannot connect to local mpd (/tmp/mpd2.console_lrz); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
there are 160 nodes on the cluster, I used "mpdboot -n -f" to initiate the mpi, and since there are always errors when i tried to boot every nodes, so i only defined 64 nodes in mpd.hosts file, and in the errors above, I dont have them in the mpd.hosts file or the command where i used my application (mpiexec command)
does anybody have any experience in this? Thanks a lot!
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf