[Beowulf] Intel MPI 2.0 mpdboot and large clusters, slow tostart up, sometimes not at all

Fri Sep 29 07:34:46 PDT 2006

> We are going through a similar experience at one of our customer sites.
> They are trying to run Intel MPI on more than 1,000 nodes.  Are you
> experiencing problems starting the MPD ring?  We noticed it takes a
> really long time especially when the node count is large.  It also just
> doesn't work sometimes.

I didn't mean to imply we're using Intel MPI (in fact, we're using HP-MPI,
but it had some issues with very large number of fd's as well - in fact,
I think we caused them to recode from select to epoll.)

so my comment was general: MPI vendors sometimes forget about how many
fd's they're using per node.

in general, though, with a modern linux system, you should be able
to simply tweak ulimit -n.  I don't think even a sysctl is necessary
(though there may also be network-derived limits - open sockets, 
routing entries, iptables, core memory limits, etc)

it would probably be illuminating to measure exactly what the critical
number is - can do 999 nodes, but 1000 fails?  also, you may find 
that turning off some features will reduce the consumption of fd's
or sockets (disable stdin forwarding/replication to all but the rank0?
disable stdout/err from all but rank0?)

this reminds me of a peeve of mine, that eth-based MPI never takes 
advantage of the hardware's inherent broad/multicast capabilities.
yes, it's convenient to use the standard TCP stack so you can ignore
reliable delivery issues, but creating rings and consuming many 
sockets by forwarding stdio are actually great examples of the downside.
treating eth as a multicast fabric and actually doing retrans in MPI
(or a sub-layer) would solve some problems.  and I suspect could lead
to some interesting performance advantages.

regards, mark.

> -----Original Message-----
> From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org]
> On Behalf Of Mark Hahn
> Sent: Friday, September 29, 2006 8:47 AM
> To: Clements, Brent M (SAIC)
> Cc: beowulf at beowulf.org
> Subject: Re: [Beowulf] Intel MPI 2.0 mpdboot and large clusters, slow
> tostart up, sometimes not at all
>
>> Does anyone have any experience running intel mpi over 1000 nodes and
> do you have any tips to speed up task execution? Any tips to solve this
> issue?
>
> it's not uncommon for someone to write naive select() code that fails
> when the number of open file descriptors hits 1024...  yes, even in
> the internals of major MPI implementations.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>

-- 
operator may differ from spokesperson.	            hahn at mcmaster.ca