[Beowulf] Timeout in making connection to remote process...
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
jorgegg at sas.upenn.edu jorgegg at sas.upenn.eduFri Jun 17 14:03:45 PDT 2005
- Previous message: [Beowulf] Any astronomy cluster folks going to ADASS XV in Madrid?
- Next message: [Beowulf] Timeout in making connection to remote process...
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi, I'm running a fortran 90 code on a Linux cluster with 7 nodes (I actually only use 6) using the MPI library. I can change the "size" of the program (meaning the number of operations to be performed although all operations are the same). The problem is that when I try to run the program using mpirun sometimes --most of the times but not always-- the program won't start running and I'll get the following message (the name of the cluster is max and it's not always the node number 2): p0_20621: p4_error: Timeout in making connection to remote process on maxsl2-d: 0 bm_list_20622: p4_error: interrupt SIGINT: 2 Some other times it would run fine even with the same number of operations! It's not the number of people using the cluster because most of the time it's only me. This problem also arises sometimes after 3 or 4 hours of running the program. Do you have any idea of why this happens? I estimate that with this number of nodes my code should run around 3 weeks to finish so I really need to rely on the computers keep communicating. Thank you very much and please let me know if I didn't explain myself clearly. Jorge
- Previous message: [Beowulf] Any astronomy cluster folks going to ADASS XV in Madrid?
- Next message: [Beowulf] Timeout in making connection to remote process...
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
