[Beowulf] Problems with a JS21 - Ah, the networking...

Mark Hahn hahn at mcmaster.ca
Sat Sep 29 10:41:29 PDT 2007


> I sniffed the network in the store nodes interface, and i got lots of
> TCP lost fragment, previos lost fragments, ack lost fragments and TCP
> window size full. The GPFS is now heavily used.

so this indicates that you have a serious ethernet problem, no?

> The myrinet connection was working right, but sometimes a user program
> just got stuck - one of the processes was sleeping, and all others
> were running. Then, the program hangs. Investigating this further,
> this happened with the simple mpich examples like cpi, cpilog, etc. We
> are using the mx driver version 1.1.6, and mpich-mx 1.2.7..5. mx_info
> shows all nodes connected when this happens, and the switch did not
> overheat. mpirun.ch_mx -v shows that all the processes are issued ok
> to the nodes, but somehow one (or more) process go to sleep or never
> starts, and all the other processes just hangs. The mx diagnose tools
> did not show any problem so far, but we still did not have done a

but spawning myrinet jobs normally involves some use of ethernet,
which has known problems.  as I recall, the protocol involves a 
rendezvous ethernet socket managed by the rank0 node. couldn't the
myrinet-starting problem simply be due to the eth problem, rather than
anything specific to myrinet?

here's an idea: configure ip-over-myrinet, and use it exclusively
to start the jobs.  if that works, then you know for sure that the 
problem is solely on the eth side (switch, perhaps, or maybe a nic
that's jabbering or otherwise misbehaving?)



More information about the Beowulf mailing list