[Beowulf] Weird problem with mpp-dyna
Michael Will
mwill at penguincomputing.com
Wed Mar 14 10:05:01 PDT 2007
You mentioned your own code does not exhibit the issue but mpp-dyna
does.
What does the support team from the software vendor think the problem
could be?
Do you use a statically linked binary or did you relink it with your
mpich?
We have ran lstc ls-dyna mpp970 and mpp971 across more than 16 nodes
without any
issues on Scyld CW4 which is also centos 4 based.
Michael
-----Original Message-----
From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org]
On Behalf Of Robert G. Brown
Sent: Wednesday, March 14, 2007 9:37 AM
To: Peter St. John
Cc: Joshua Baker-LePain; beowulf at beowulf.org
Subject: Re: [Beowulf] Weird problem with mpp-dyna
On Wed, 14 Mar 2007, Peter St. John wrote:
> I just want to mention (not being a sysadmin professionally, at all)
> that you could get exactly this result if something were assigning IP
> addresses sequentially, e.g.
> node1 = foo.bar.1
> node2 = foo.bar.2
> ...
> and something else had already assigned 13 to a public thing, say, a
> webserver that is not open on the port that MPI uses.
> I don't know nada about addressing a CPU within a multiprocessor
> machine, but if it has it's own IP address then it could choke this
way.
On the same note, I'm always fond of looking for loose wires or bad
switches or dying hardware on a bizarrely inconsistent network
connection. Does this only happen in MPI? Or can you get oddities
using a network testing program e.g. netpipe (which will let you test
raw sockets, mpi, pvm in situ)?
rgb
>
> Peter
>
>
> On 3/14/07, Joshua Baker-LePain <jlb17 at duke.edu> wrote:
>>
>> I have a user trying to run a coupled structural thermal analsis
>> using mpp-dyna (mpp971_d_7600.2.398). The underlying OS is centos-4
>> on x86_64 hardware. We use our cluster largely as a COW, so all the
>> cluster nodes have both public and private network interfaces. All
>> MPI traffic is passed on the private network.
>>
>> Running a simulation via 'mpirun -np 12' works just fine. Running
>> the same sim (on the same virtual machine, even, i.e. in the same
'lamboot'
>> session) with -np > 12 leads to the following output:
>>
>> Performing Decomposition -- Phase 3 03/12/2007
>> 11:47:53
>>
>>
>> *** Error the number of solid elements 13881 defined on the thermal
>> generation control card is greater than the total number of solids in
>> the model 12984
>>
>> *** Error the number of solid elements 13929 defined on the thermal
>> generation control card is greater than the total number of solids in
>> the model 12985 connect to address $ADDRESS: Connection timed out
>> connect to address $ADDRESS: Connection timed out
>>
>> where $ADDRESS is the IP address of the *public* interface of the
>> node on which the job was launched. Has anybody seen anything like
>> this? Any ideas on why it would fail over a specific number of CPUs?
>>
>> Note that the failure is CPU dependent, not node-count dependent.
>> I've tried on clusters made of both dual-CPU machines and quad-CPU
>> machines, and in both cases it took 13 CPUs to create the failure.
>> Note also that I *do* have a user writing his own MPI code, and he
>> has no issues running on >12 CPUs.
>>
>> Thanks.
>>
>> --
>> Joshua Baker-LePain
>> Department of Biomedical Engineering
>> Duke University
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org To change your subscription
>> (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>
--
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org To change your subscription
(digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list