p4_error: net_recv read: probable EOF on socket: 1
rross at mcs.anl.gov
Mon Jan 28 21:20:34 PST 2002
Memory leaks and bad pointer manipulation have a way of "just working" a
lot of the time, so it wouldn't surprise me one bit to have some code that
works on a bunch of platforms but not on some other specific one. As a
developer on a couple of fairly large projects, it amazes me some times
the bugs that we find literally YEARS into a project that I would have
thought would have cropped up long ago.
ElectricFence is a good, free, often already installed tool for helping
make memory related problems show themselves (see "man efence").
There have also been two subsequent releases of MPICH, so you could try
upgrading to MPICH 1.2.3 as well.
Rob Ross, Mathematics and Computer Science Division, Argonne National Lab
On Mon, 28 Jan 2002, Dr. David F. Robinson wrote:
> Several people have responded saying the most likely problem is a memory
> leak in my code. The only difficulty I have with that theory is that
> the code doesn't have any problems on a variety of alternate platforms
> including Cray T90, T3E, SP2, SGI Origin, and even on a number of
> different Linux Beowulf clusters. It is currently running on several
> large Linux clusters setup by VALinux as well as a 512processor system
> developed by IBM (Maui supercomputing center).
> Because of the large number of users and platforms, I am thinking that
> it's a problem with the Scyld setup on my cluster. I have talked to
> several other groups running this software, and the only consistent
> difference is the Scyld software. Has anyone else run into problems
> running with the Scyld software that isn't duplicated on other
> Thanks to all who have responded.... And if anyone else has any further
> suggestions, I'm definitely interested....
More information about the Beowulf