[Beowulf] Intel MPI 2.0 mpdboot and large clusters, slow tostart up, sometimes not at all
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
M J Harvey m.j.harvey at imperial.ac.ukWed Oct 4 09:22:46 PDT 2006
- Previous message: [Beowulf] commercial clusters
- Next message: [Beowulf] Intel MPI 2.0 mpdboot and large clusters, slow tostart up, sometimes not at all
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hello, > We are going through a similar experience at one of our customer sites. > They are trying to run Intel MPI on more than 1,000 nodes. Are you > experiencing problems starting the MPD ring? We noticed it takes a > really long time especially when the node count is large. It also just > doesn't work sometimes. I've had similar problems with slow and unreliable startup of the Intel mpd ring. I noticed that before spawning the individual mpds, it connects to each node and checks the version of the installed python (function getversionpython() in mpdboot.py). On my cluster, at least, this check was very slow (not to say pointless). Removing it dramatically improved startup time - now it's merely slow. Also, for jobs with large process counts, it's worth increasing recvTimeout in mpirun from 20 seconds. This value governs the amount of time mpirun waits for the secondary mpi processes to be spawned by the remote mpds and the default value is much too aggressive for large jobs started via ssh. Kind Regards, Matt
- Previous message: [Beowulf] commercial clusters
- Next message: [Beowulf] Intel MPI 2.0 mpdboot and large clusters, slow tostart up, sometimes not at all
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
