[Beowulf] Weird problem with mpp-dyna

Peter St. John peter.st.john at gmail.com
Wed Mar 14 09:12:40 PDT 2007


I just want to mention (not being a sysadmin professionally, at all) that
you could get exactly this result if something were assigning IP addresses
sequentially, e.g.
node1 = foo.bar.1
node2 = foo.bar.2
...
and something else had already assigned 13 to a public thing, say, a
webserver that is not open on the port that MPI uses.
I don't know nada about addressing a CPU within a multiprocessor machine,
but if it has it's own IP address then it could choke this way.

Peter


On 3/14/07, Joshua Baker-LePain <jlb17 at duke.edu> wrote:
>
> I have a user trying to run a coupled structural thermal analsis using
> mpp-dyna (mpp971_d_7600.2.398).  The underlying OS is centos-4 on x86_64
> hardware.  We use our cluster largely as a COW, so all the cluster nodes
> have both public and private network interfaces.  All MPI traffic is
> passed on the private network.
>
> Running a simulation via 'mpirun -np 12' works just fine.  Running the
> same sim (on the same virtual machine, even, i.e. in the same 'lamboot'
> session) with -np > 12 leads to the following output:
>
> Performing Decomposition -- Phase 3 03/12/2007
> 11:47:53
>
>
> *** Error the number of solid elements 13881
> defined on the thermal generation control
> card is greater than the total number
> of solids in the model 12984
>
> *** Error the number of solid elements 13929
> defined on the thermal generation control
> card is greater than the total number
> of solids in the model 12985
> connect to address $ADDRESS: Connection timed out
> connect to address $ADDRESS: Connection timed out
>
> where $ADDRESS is the IP address of the *public* interface of the node on
> which the job was launched.  Has anybody seen anything like this?  Any
> ideas on why it would fail over a specific number of CPUs?
>
> Note that the failure is CPU dependent, not node-count dependent.
> I've tried on clusters made of both dual-CPU machines and quad-CPU
> machines, and in both cases it took 13 CPUs to create the failure.
> Note also that I *do* have a user writing his own MPI code, and he has no
> issues running on >12 CPUs.
>
> Thanks.
>
> --
> Joshua Baker-LePain
> Department of Biomedical Engineering
> Duke University
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20070314/69872474/attachment.html>


More information about the Beowulf mailing list