[Beowulf] Intel MPI 2.0 mpdboot and large clusters, slow tostart up, sometimes not at all
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
William Gropp gropp at mcs.anl.govThu Oct 5 11:59:33 PDT 2006
- Previous message: [Beowulf] Intel MPI 2.0 mpdboot and large clusters, slow tostart up, sometimes not at all
- Next message: [Beowulf] Intel MPI 2.0 mpdboot and large clusters, slow tostart up, sometimes not at all
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
If you have a batch system that can start the MPDs, you should consider starting the MPI processes directly with the batch system and providing a separate service to provide the startup information. In MPICH2, the MPI implementation is separated from the process management. The mpd system is simply an example process manager (albeit one with many useful features). We didn't expect users to use one existing parallel process management system to start another one; instead, we expected that those existing systems would use the PMI interface used in MPICH2 to directly start the MPI processes. I know that you don't need MPD for MPICH2; I expect the same is true for Intel MPI. Bill On Oct 4, 2006, at 11:31 AM, Bill Bryce wrote: > Hi Matt, > > You pretty much diagnosed our problem correctly. After discussing > with > the customer and a few more engineers here we found that the python > code > was very slow at starting the ring. Seems to be a common problem with > MPD startup on other MPI implementations as well (I could be wrong > though). We also modified the recvTimeout since onsite engineers > suspected that would help as well. The final fix we are working on is > starting the MPD with the batch system and not relying on ssh - the > customer does not want a root MPD ring and wants one per job so the > batch system will do this for us. > > Bill. > > > -----Original Message----- > From: M J Harvey [mailto:m.j.harvey at imperial.ac.uk] > Sent: Wednesday, October 04, 2006 12:23 PM > To: Bill Bryce > Cc: beowulf at beowulf.org > Subject: Re: [Beowulf] Intel MPI 2.0 mpdboot and large clusters, slow > tostart up, sometimes not at all > > Hello, > >> We are going through a similar experience at one of our customer > sites. >> They are trying to run Intel MPI on more than 1,000 nodes. Are you >> experiencing problems starting the MPD ring? We noticed it takes a >> really long time especially when the node count is large. It also > just >> doesn't work sometimes. > > I've had similar problems with slow and unreliable startup of the > Intel > mpd ring. I noticed that before spawning the individual mpds, it > connects to each node and checks the version of the installed python > (function getversionpython() in mpdboot.py). On my cluster, at least, > this check was very slow (not to say pointless). Removing it > dramatically improved startup time - now it's merely slow. > > Also, for jobs with large process counts, it's worth increasing > recvTimeout in mpirun from 20 seconds. This value governs the > amount of > time mpirun waits for the secondary mpi processes to be spawned by the > remote mpds and the default value is much too aggressive for large > jobs > started via ssh. > > Kind Regards, > > Matt > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf >
- Previous message: [Beowulf] Intel MPI 2.0 mpdboot and large clusters, slow tostart up, sometimes not at all
- Next message: [Beowulf] Intel MPI 2.0 mpdboot and large clusters, slow tostart up, sometimes not at all
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
