[Beowulf] Problems with a JS21 - Ah, the networking...

Patrick Geoffray patrick at myri.com
Mon Oct 1 06:35:38 PDT 2007


Hi Ivan,

Ivan Paganini wrote:
> The myrinet connection was working right, but sometimes a user program
> just got stuck - one of the processes was sleeping, and all others
> were running. Then, the program hangs. Investigating this further,

Unless you are using bocking receives ("--mx-recv blocking" or 
"--mx-recv hybrid"), the default mode is polling. So, a process will 
only sleep if it is still in the spawning phase (in MPI_Init) or if it's 
blocking on something outside MPI (like disk IO).

> overheat. mpirun.ch_mx -v shows that all the processes are issued ok
> to the nodes, but somehow one (or more) process go to sleep or never
> starts, and all the other processes just hangs. The mx diagnose tools

All processes wait on everybody at spawn time, so if one process never 
starts, the rest of the MPI world will wait for it, possibly forever. 
The root problem is the process not starting.

The spawning phase in MPICH-MX uses socket and ssh (or rsh). Usually, 
ssh uses native Ethernet, but it could also use IPoM (Ethernet over 
Myrinet). Which case is it for you ?

Patrick



More information about the Beowulf mailing list