p4_error: net_recv read: probable EOF on socket: 1

Tue Jan 29 05:52:26 PST 2002

I'm no expert at this, but I would like to point out that it has been reported on this list that this problem has
happened running the MPICH test cases, cpi and srtest.  Hopefully, these would be simple enough not to have
memory leaks.

We also have this problem with our cluster of dual Alpha UP2000s.  We found if we run two jobs, each using one
CPU from each node, we don't get errors and we can keep all of the CPUs busy.  I don't know if this would be
consistent with a memory leak.

I will also admit I haven't tested MPICH 1.2.3 yet.  That's the problem with work-arounds, you stop working on
the problem.

Thanks for the advice about ElectricFence.

Tom

  ------------------------------------------------------------------------

Robert Ross wrote:

> Hi,
>
> Memory leaks and bad pointer manipulation have a way of "just working" a
> lot of the time, so it wouldn't surprise me one bit to have some code that
> works on a bunch of platforms but not on some other specific one.  As a
> developer on a couple of fairly large projects, it amazes me some times
> the bugs that we find literally YEARS into a project that I would have
> thought would have cropped up long ago.
>
> ElectricFence is a good, free, often already installed tool for helping
> make memory related problems show themselves (see "man efence").
>
> There have also been two subsequent releases of MPICH, so you could try
> upgrading to MPICH 1.2.3 as well.
>
> Regards,
>
> Rob
> ---
> Rob Ross, Mathematics and Computer Science Division, Argonne National Lab
>
> On Mon, 28 Jan 2002, Dr. David F. Robinson wrote:
>
> > Several people have responded saying the most likely problem is a memory
> > leak in my code.  The only difficulty I have with that theory is that
> > the code doesn't have any problems on a variety of alternate platforms
> > including Cray T90, T3E, SP2, SGI Origin, and even on a number of
> > different Linux Beowulf clusters.  It is currently running on several
> > large Linux clusters setup by VALinux as well as a 512processor system
> > developed by IBM (Maui supercomputing center).
> >
> > Because of the large number of users and platforms, I am thinking that
> > it's a problem with the Scyld setup on my cluster.  I have talked to
> > several other groups running this software, and the only consistent
> > difference is the Scyld software.  Has anyone else run into problems
> > running with the Scyld software that isn't duplicated on other
> > platforms?
> >
> > Thanks to all who have responded.... And if anyone else has any further
> > suggestions, I'm definitely interested....
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf