p4_error: net_recv read: probable EOF on socket: 1
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Dr. David F. Robinson drobinson at aletheon.comMon Jan 28 17:51:37 PST 2002
- Previous message: p4_error: net_recv read: probable EOF on socket: 1
- Next message: p4_error: net_recv read: probable EOF on socket: 1
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Several people have responded saying the most likely problem is a memory leak in my code. The only difficulty I have with that theory is that the code doesn't have any problems on a variety of alternate platforms including Cray T90, T3E, SP2, SGI Origin, and even on a number of different Linux Beowulf clusters. It is currently running on several large Linux clusters setup by VALinux as well as a 512processor system developed by IBM (Maui supercomputing center). Because of the large number of users and platforms, I am thinking that it's a problem with the Scyld setup on my cluster. I have talked to several other groups running this software, and the only consistent difference is the Scyld software. Has anyone else run into problems running with the Scyld software that isn't duplicated on other platforms? Thanks to all who have responded.... And if anyone else has any further suggestions, I'm definitely interested.... David -----Original Message----- From: walt at splinter.parl.clemson.edu [mailto:walt at splinter.parl.clemson.edu] On Behalf Of Walter B. Ligon III Sent: Monday, January 28, 2002 11:33 AM To: drobinson at aletheon.com Cc: beowulf at beowulf.org Subject: Re: p4_error: net_recv read: probable EOF on socket: 1 -------- I'm not the expert on this, but as best I understand it this is a fairly generic MPICH error which says a task tried to receive and the process it tried to receive from has gone away. More often than not, unless you have lots of OTHER error messages, this is some kind of program error (IOW, *your* program's error). There are so many ways your program could be failing there is no way to do a reasonable job of telling you what to look for, but usually adding some kind of debugging output to your code will help. If you can isolate which task is failing, you can crash it in gdb and see exactly where and why it crashed. Since this is probably your code, there isn't a generic solution, and that's why it isn't posted. Walt > I am receiving the following errors while running my mpi enabled code. > > p4_error: net_recv read: probable EOF on socket: 1 > > This error occurs after running the code for several hours using all > processors in my cluster. I have seen several postings similar to this > on the web, however, I have not seen any posted solutions. My > configuration is as follows: > > Mpich_1.2.1 compiled w/ Portland compilers > Scyld 27cz-8 (Red Hat Linux 6.2) > Linux 2.2.19 > > I have tried to update my eepro100 drivers by downloading and compiling > the netdrivers.tgz file from the Scyld ftp site. They compiled and > installed fine using 'make' and 'make install', however, the driver on > the slave nodes has not been updated. When I reboot the master node and > do a dmesg, the latest driver is being implemented on the master. The > slave nodes are still booting with the old driver. How do I get the > boot image for the slaves to use the updated modules? Are my problems > caused by the old eepro100 drivers? > > Any help is greatly appreciated. > Thanks, David -- Dr. Walter B. Ligon III Associate Professor ECE Department Clemson University
- Previous message: p4_error: net_recv read: probable EOF on socket: 1
- Next message: p4_error: net_recv read: probable EOF on socket: 1
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
