[Beowulf] Intel MPI 2.0 mpdboot and large clusters, slow to start up, sometimes not at all

Clements, Brent M (SAIC) brent.clements at bp.com
Thu Sep 28 23:43:50 PDT 2006

I buddy of mine who has a cluster that is over 1000(2000) nodes.
I've compiled a simple helloworld app to test it out.
I am using Intel MPI 2.0 and running over ethernet so I'm trying both the ssm(since the nodes are smp machines) and sock devices
i'm doing the following mpdboot -n 1500 --rsh=ssh 
I do a mpdtrace and all of the nodes in my mpd.hosts file is there.
I do a mpiexec -np 1500 ./helloworld and I get a newline 
15-20 minutes goes by and nothing happens. It looks like something is timing out.
Run the program on anywhere below 128 processors and it works.
Does anyone have any experience running intel mpi over 1000 nodes and do you have any tips to speed up task execution? Any tips to solve this issue?
This message may contain confidential and/or privileged information.  If you are not the addressee or authorized to receive this for the addressee, you must not use, copy, disclose, or take any action based on this message or any information herein.  If you have received this message in error, please advise the sender immediately by reply e-mail and delete this message. Thank you for your cooperation.

More information about the Beowulf mailing list