<div>I think I"d just grep for </div>
<div><non-numeric>13<non-numeric></div>
<div>in all the scripts and confiuration files. Of course with my luck it would be 013 :-)</div>
<div>Peter<br><br> </div>
<div><span class="gmail_quote">On 3/14/07, <b class="gmail_sendername">Michael Will</b> <<a href="mailto:mwill@penguincomputing.com">mwill@penguincomputing.com</a>> wrote:</span>
<blockquote class="gmail_quote" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">You mentioned your own code does not exhibit the issue but mpp-dyna<br>does.<br><br>What does the support team from the software vendor think the problem
<br>could be?<br><br>Do you use a statically linked binary or did you relink it with your<br>mpich?<br><br>We have ran lstc ls-dyna mpp970 and mpp971 across more than 16 nodes<br>without any<br>issues on Scyld CW4 which is also centos 4 based.
<br><br>Michael<br><br>-----Original Message-----<br>From: <a href="mailto:beowulf-bounces@beowulf.org">beowulf-bounces@beowulf.org</a> [mailto:<a href="mailto:beowulf-bounces@beowulf.org">beowulf-bounces@beowulf.org</a>]
<br>On Behalf Of Robert G. Brown<br>Sent: Wednesday, March 14, 2007 9:37 AM<br>To: Peter St. John<br>Cc: Joshua Baker-LePain; <a href="mailto:beowulf@beowulf.org">beowulf@beowulf.org</a><br>Subject: Re: [Beowulf] Weird problem with mpp-dyna
<br><br>On Wed, 14 Mar 2007, Peter St. John wrote:<br><br>> I just want to mention (not being a sysadmin professionally, at all)<br>> that you could get exactly this result if something were assigning IP<br>> addresses sequentially,
e.g.<br>> node1 = foo.bar.1<br>> node2 = foo.bar.2<br>> ...<br>> and something else had already assigned 13 to a public thing, say, a<br>> webserver that is not open on the port that MPI uses.<br>> I don't know nada about addressing a CPU within a multiprocessor
<br>> machine, but if it has it's own IP address then it could choke this<br>way.<br><br>On the same note, I'm always fond of looking for loose wires or bad<br>switches or dying hardware on a bizarrely inconsistent network
<br>connection. Does this only happen in MPI? Or can you get oddities<br>using a network testing program e.g. netpipe (which will let you test<br>raw sockets, mpi, pvm in situ)?<br><br> rgb<br><br>><br>> Peter<br>
><br>><br>> On 3/14/07, Joshua Baker-LePain <<a href="mailto:jlb17@duke.edu">jlb17@duke.edu</a>> wrote:<br>>><br>>> I have a user trying to run a coupled structural thermal analsis<br>>> using mpp-dyna (mpp971_d_7600.2.398). The underlying OS is centos-4
<br>>> on x86_64 hardware. We use our cluster largely as a COW, so all the<br>>> cluster nodes have both public and private network interfaces. All<br>>> MPI traffic is passed on the private network.<br>
>><br>>> Running a simulation via 'mpirun -np 12' works just fine. Running<br>>> the same sim (on the same virtual machine, even, i.e. in the same<br>'lamboot'<br>>> session) with -np > 12 leads to the following output:
<br>>><br>>> Performing Decomposition -- Phase 3 03/12/2007<br>>> 11:47:53<br>>><br>>><br>>> *** Error the number of solid elements 13881 defined on the thermal<br>>> generation control card is greater than the total number of solids in
<br><br>>> the model 12984<br>>><br>>> *** Error the number of solid elements 13929 defined on the thermal<br>>> generation control card is greater than the total number of solids in<br><br>>> the model 12985 connect to address $ADDRESS: Connection timed out
<br>>> connect to address $ADDRESS: Connection timed out<br>>><br>>> where $ADDRESS is the IP address of the *public* interface of the<br>>> node on which the job was launched. Has anybody seen anything like
<br>>> this? Any ideas on why it would fail over a specific number of CPUs?<br>>><br>>> Note that the failure is CPU dependent, not node-count dependent.<br>>> I've tried on clusters made of both dual-CPU machines and quad-CPU
<br>>> machines, and in both cases it took 13 CPUs to create the failure.<br>>> Note also that I *do* have a user writing his own MPI code, and he<br>>> has no issues running on >12 CPUs.<br>>><br>
>> Thanks.<br>>><br>>> --<br>>> Joshua Baker-LePain<br>>> Department of Biomedical Engineering<br>>> Duke University<br>>> _______________________________________________<br>>> Beowulf mailing list,
<a href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a> To change your subscription<br><br>>> (digest mode or unsubscribe) visit<br>>> <a href="http://www.beowulf.org/mailman/listinfo/beowulf">http://www.beowulf.org/mailman/listinfo/beowulf
</a><br>>><br>><br><br>--<br>Robert G. Brown <a href="http://www.phy.duke.edu/~rgb/">http://www.phy.duke.edu/~rgb/</a><br>Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305
<br>Phone: 1-919-660-2567 Fax: 919-660-2525 <a href="mailto:email:rgb@phy.duke.edu">email:rgb@phy.duke.edu</a><br><br><br>_______________________________________________<br>Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org">
Beowulf@beowulf.org</a> To change your subscription<br>(digest mode or unsubscribe) visit<br><a href="http://www.beowulf.org/mailman/listinfo/beowulf">http://www.beowulf.org/mailman/listinfo/beowulf</a><br></blockquote></div>
<br>