[Beowulf] Weird problem with mpp-dyna

Joshua Baker-LePain jlb17 at duke.edu
Wed Mar 14 11:14:31 PDT 2007


On Wed, 14 Mar 2007 at 10:05am, Michael Will wrote

> You mentioned your own code does not exhibit the issue but mpp-dyna
> does.

Yep.

> What does the support team from the software vendor think the problem
> could be?

They say that we have an academic license which does not entitle us to any 
support, but they'll look at the issue if/when they have some spare time. 
Which, really, is fair, given what we pay for it.

> Do you use a statically linked binary or did you relink it with your
> mpich?

Agh.  I forgot to mention this little wrinkle.  LSTC software distribution 
is... interesting.  For mpp-dyna, they ship dynamically linked binaries 
compiled against a specific version of LAM/MPI (7.0.3 in this case). 
They also provide the matching pre-compiled LAM/MPI libraries on their 
site. For a fun little wrinkle, RHEL/CentOS ships LAM/MPI 7.0.6. 
However, the spec file in their RPM does *not* include the --enable-shared 
flag.  IOW, the OS vendor's LAM/MPI package has no .so files.

It seems like it'd be worth re-compiling the centos lam RPM to include the 
shared libraries and run against those to see if it helps.

> We have ran lstc ls-dyna mpp970 and mpp971 across more than 16 nodes
> without any issues on Scyld CW4 which is also centos 4 based.

We can run straight structural sims across as many nodes/CPUs as we've 
tried, and ditto for straight thermal sims.  It's just on coupled 
structural/thermal sims that this issue crops up.  That, to me, rather 
points to a bug in dyna itself.  But the fact that the bug manifests 
itself (at least in part) by the MPI job trying to talk to a different 
network interface than was 'lamboot'ed is what is throwing me off a bit.

-- 
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University



More information about the Beowulf mailing list