[Beowulf] Weird problem with mpp-dyna
Joe Landman
landman at scalableinformatics.com
Wed Mar 14 11:51:29 PDT 2007
Joshua Baker-LePain wrote:
>
> Running a simulation via 'mpirun -np 12' works just fine. Running the
> same sim (on the same virtual machine, even, i.e. in the same 'lamboot'
> session) with -np > 12 leads to the following output:
[...]
> *** Error the number of solid elements 13929
> defined on the thermal generation control
> card is greater than the total number
> of solids in the model 12985
> connect to address $ADDRESS: Connection timed out
> connect to address $ADDRESS: Connection timed out
When you set up that VM via LAM, you did a lamboot .... Could you send
the output of
tping -c 3 N
for the larger VM? Also, what does your machine file look like, and
could you share what
lamboot -d machinefile
returns for N>12? Note, that is a big bit of output, so you might want
to send that offline.
> where $ADDRESS is the IP address of the *public* interface of the node
> on which the job was launched. Has anybody seen anything like this?
Yes, with a borked DNS server on a head node, coupled to an incorrectly
setup queuing system. We have seen this at a few customer sites.
> Any ideas on why it would fail over a specific number of CPUs?
It doesn't sound like it is failing on a specific number of CPUs, more
like there is a public address, which likely has iptables on it,
preventing that node from reaching back into the private space.
>
> Note that the failure is CPU dependent, not node-count dependent.
> I've tried on clusters made of both dual-CPU machines and quad-CPU
> machines, and in both cases it took 13 CPUs to create the failure.
> Note also that I *do* have a user writing his own MPI code, and he has
> no issues running on >12 CPUs.
What do the machine files look like? Are they auto generated?
>
> Thanks.
>
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 734 786 8452 or +1 866 888 3112
cell : +1 734 612 4615
More information about the Beowulf
mailing list