[Beowulf] Weird problem with mpp-dyna
Peter St. John
peter.st.john at gmail.com
Wed Mar 14 10:12:56 PDT 2007
I think I"d just grep for
<non-numeric>13<non-numeric>
in all the scripts and confiuration files. Of course with my luck it would
be 013 :-)
Peter
On 3/14/07, Michael Will <mwill at penguincomputing.com> wrote:
>
> You mentioned your own code does not exhibit the issue but mpp-dyna
> does.
>
> What does the support team from the software vendor think the problem
> could be?
>
> Do you use a statically linked binary or did you relink it with your
> mpich?
>
> We have ran lstc ls-dyna mpp970 and mpp971 across more than 16 nodes
> without any
> issues on Scyld CW4 which is also centos 4 based.
>
> Michael
>
> -----Original Message-----
> From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org]
> On Behalf Of Robert G. Brown
> Sent: Wednesday, March 14, 2007 9:37 AM
> To: Peter St. John
> Cc: Joshua Baker-LePain; beowulf at beowulf.org
> Subject: Re: [Beowulf] Weird problem with mpp-dyna
>
> On Wed, 14 Mar 2007, Peter St. John wrote:
>
> > I just want to mention (not being a sysadmin professionally, at all)
> > that you could get exactly this result if something were assigning IP
> > addresses sequentially, e.g.
> > node1 = foo.bar.1
> > node2 = foo.bar.2
> > ...
> > and something else had already assigned 13 to a public thing, say, a
> > webserver that is not open on the port that MPI uses.
> > I don't know nada about addressing a CPU within a multiprocessor
> > machine, but if it has it's own IP address then it could choke this
> way.
>
> On the same note, I'm always fond of looking for loose wires or bad
> switches or dying hardware on a bizarrely inconsistent network
> connection. Does this only happen in MPI? Or can you get oddities
> using a network testing program e.g. netpipe (which will let you test
> raw sockets, mpi, pvm in situ)?
>
> rgb
>
> >
> > Peter
> >
> >
> > On 3/14/07, Joshua Baker-LePain <jlb17 at duke.edu> wrote:
> >>
> >> I have a user trying to run a coupled structural thermal analsis
> >> using mpp-dyna (mpp971_d_7600.2.398). The underlying OS is centos-4
> >> on x86_64 hardware. We use our cluster largely as a COW, so all the
> >> cluster nodes have both public and private network interfaces. All
> >> MPI traffic is passed on the private network.
> >>
> >> Running a simulation via 'mpirun -np 12' works just fine. Running
> >> the same sim (on the same virtual machine, even, i.e. in the same
> 'lamboot'
> >> session) with -np > 12 leads to the following output:
> >>
> >> Performing Decomposition -- Phase 3 03/12/2007
> >> 11:47:53
> >>
> >>
> >> *** Error the number of solid elements 13881 defined on the thermal
> >> generation control card is greater than the total number of solids in
>
> >> the model 12984
> >>
> >> *** Error the number of solid elements 13929 defined on the thermal
> >> generation control card is greater than the total number of solids in
>
> >> the model 12985 connect to address $ADDRESS: Connection timed out
> >> connect to address $ADDRESS: Connection timed out
> >>
> >> where $ADDRESS is the IP address of the *public* interface of the
> >> node on which the job was launched. Has anybody seen anything like
> >> this? Any ideas on why it would fail over a specific number of CPUs?
> >>
> >> Note that the failure is CPU dependent, not node-count dependent.
> >> I've tried on clusters made of both dual-CPU machines and quad-CPU
> >> machines, and in both cases it took 13 CPUs to create the failure.
> >> Note also that I *do* have a user writing his own MPI code, and he
> >> has no issues running on >12 CPUs.
> >>
> >> Thanks.
> >>
> >> --
> >> Joshua Baker-LePain
> >> Department of Biomedical Engineering
> >> Duke University
> >> _______________________________________________
> >> Beowulf mailing list, Beowulf at beowulf.org To change your subscription
>
> >> (digest mode or unsubscribe) visit
> >> http://www.beowulf.org/mailman/listinfo/beowulf
> >>
> >
>
> --
> Robert G. Brown http://www.phy.duke.edu/~rgb/
> Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305
> Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org To change your subscription
> (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20070314/2773257e/attachment.html>
More information about the Beowulf
mailing list