[Beowulf] Weird problem with mpp-dyna

Mark McCardell mam3wn at virginia.edu
Wed Mar 14 08:23:50 PDT 2007


Joshua,

I'm willing to bet this is an issue with the mpp solver and good luck trying
to get any help from LSTC.  We have ran into so many quirks with the mpp
solver. Most of our work are FE models.

Mark McCardell
Computer Systems Engineer
Center for Applied Biomechanics - University of Virginia
www.CenterForAppliedBiomechanics.org

-----Original Message-----
From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On
Behalf Of Joshua Baker-LePain
Sent: Wednesday, March 14, 2007 9:33 AM
To: beowulf at beowulf.org
Subject: [Beowulf] Weird problem with mpp-dyna

I have a user trying to run a coupled structural thermal analsis using 
mpp-dyna (mpp971_d_7600.2.398).  The underlying OS is centos-4 on x86_64 
hardware.  We use our cluster largely as a COW, so all the cluster nodes 
have both public and private network interfaces.  All MPI traffic is 
passed on the private network.

Running a simulation via 'mpirun -np 12' works just fine.  Running the 
same sim (on the same virtual machine, even, i.e. in the same 'lamboot' 
session) with -np > 12 leads to the following output:

Performing Decomposition -- Phase 3 03/12/2007
11:47:53


*** Error the number of solid elements 13881
defined on the thermal generation control
card is greater than the total number
of solids in the model 12984

*** Error the number of solid elements 13929
defined on the thermal generation control
card is greater than the total number
of solids in the model 12985
connect to address $ADDRESS: Connection timed out
connect to address $ADDRESS: Connection timed out

where $ADDRESS is the IP address of the *public* interface of the node on 
which the job was launched.  Has anybody seen anything like this?  Any 
ideas on why it would fail over a specific number of CPUs?

Note that the failure is CPU dependent, not node-count dependent.
I've tried on clusters made of both dual-CPU machines and quad-CPU
machines, and in both cases it took 13 CPUs to create the failure.
Note also that I *do* have a user writing his own MPI code, and he has no 
issues running on >12 CPUs.

Thanks.

-- 
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf





More information about the Beowulf mailing list