[Beowulf] Weird problem with mpp-dyna

Joe Landman landman at scalableinformatics.com
Wed Mar 14 11:51:29 PDT 2007

Joshua Baker-LePain wrote:

> Running a simulation via 'mpirun -np 12' works just fine.  Running the 
> same sim (on the same virtual machine, even, i.e. in the same 'lamboot' 
> session) with -np > 12 leads to the following output:


> *** Error the number of solid elements 13929
> defined on the thermal generation control
> card is greater than the total number
> of solids in the model 12985
> connect to address $ADDRESS: Connection timed out
> connect to address $ADDRESS: Connection timed out

When you set up that VM via LAM, you did a lamboot ....  Could you send 
the output of

	tping -c 3 N

for the larger VM?  Also, what does your machine file look like, and 
could you share what

	lamboot -d machinefile

returns for N>12? Note, that is a big bit of output, so you might want 
to send that offline.

> where $ADDRESS is the IP address of the *public* interface of the node 
> on which the job was launched.  Has anybody seen anything like this?  

Yes, with a borked DNS server on a head node, coupled to an incorrectly 
setup queuing system.  We have seen this at a few customer sites.

> Any ideas on why it would fail over a specific number of CPUs?

It doesn't sound like it is failing on a specific number of CPUs, more 
like there is a public address, which likely has iptables on it, 
preventing that node from reaching back into the private space.

> Note that the failure is CPU dependent, not node-count dependent.
> I've tried on clusters made of both dual-CPU machines and quad-CPU
> machines, and in both cases it took 13 CPUs to create the failure.
> Note also that I *do* have a user writing his own MPI code, and he has 
> no issues running on >12 CPUs.

What do the machine files look like?  Are they auto generated?

> Thanks.


