[Beowulf] Intel MPI 2.0 mpdboot and large clusters, slow to start up, sometimes not at all

Clements, Brent M (SAIC) brent.clements at bp.com
Fri Sep 29 06:26:13 PDT 2006


I've solved the issue. It was a network problem. 
 
And yes, I konw about the file descriptor problem.
 
Mark, my question was a bit misleading. :-) I asked a very simplistic and broad question to see if I missed anything before I started troubleshooting other parts of the cluster. And yes I know about the file descriptor problem. :-) 
 
Thanks for your help buddy.
 
 
This message may contain confidential and/or privileged information.  If you are not the addressee or authorized to receive this for the addressee, you must not use, copy, disclose, or take any action based on this message or any information herein.  If you have received this message in error, please advise the sender immediately by reply e-mail and delete this message. Thank you for your cooperation.

________________________________

From: Mark Hahn [mailto:hahn at physics.mcmaster.ca]
Sent: Fri 9/29/2006 7:46 AM
To: Clements, Brent M (SAIC)
Cc: beowulf at beowulf.org
Subject: Re: [Beowulf] Intel MPI 2.0 mpdboot and large clusters, slow to start up, sometimes not at all



> Does anyone have any experience running intel mpi over 1000 nodes and do you have any tips to speed up task execution? Any tips to solve this issue?

it's not uncommon for someone to write naive select() code that fails
when the number of open file descriptors hits 1024...  yes, even in
the internals of major MPI implementations.






More information about the Beowulf mailing list