p4_error: net_recv read: probable EOF on socket: 1

Walter B. Ligon III walt at clemson.edu
Mon Jan 28 08:32:54 PST 2002


I'm not the expert on this, but as best I understand it this is a
fairly generic MPICH error which says a task tried to receive and
the process it tried to receive from has gone away.  More often
than not, unless you have lots of OTHER error messages, this is
some kind of program error (IOW, *your* program's error).

There are so many ways your program could be failing there is no
way to do a reasonable job of telling you what to look for, but
usually adding some kind of debugging output to your code will
help.  If you can isolate which task is failing, you can crash
it in gdb and see exactly where and why it crashed.

Since this is probably your code, there isn't a generic solution,
and that's why it isn't posted.


> I am receiving the following errors while running my mpi enabled code.  
> p4_error: net_recv read:  probable EOF on socket: 1
> This error occurs after running the code for several hours using all
> processors in my cluster.  I have seen several postings similar to this
> on the web, however, I have not seen any posted solutions.  My
> configuration is as follows:
> Mpich_1.2.1 compiled w/ Portland compilers
> Scyld 27cz-8 (Red Hat Linux 6.2)
> Linux 2.2.19
> I have tried to update my eepro100 drivers by downloading and compiling
> the netdrivers.tgz file from the Scyld ftp site.  They compiled and
> installed fine using 'make' and 'make install', however, the driver on
> the slave nodes has not been updated.  When I reboot the master node and
> do a dmesg, the latest driver is being implemented on the master.  The
> slave nodes are still booting with the old driver.  How do I get the
> boot image for the slaves to use the updated modules?  Are my problems
> caused by the old eepro100 drivers?
> Any help is greatly appreciated.
> Thanks, David
Dr. Walter B. Ligon III
Associate Professor
ECE Department
Clemson University

More information about the Beowulf mailing list