[Beowulf] Problems with a JS21 - Ah, the networking...
Ivan Paganini
ispmarin at gmail.com
Sat Sep 29 12:40:26 PDT 2007
Hello Mark!
2007/9/29, Mark Hahn <hahn at mcmaster.ca>:
> > I sniffed the network in the store nodes interface, and i got lots of
> > TCP lost fragment, previos lost fragments, ack lost fragments and TCP
> > window size full. The GPFS is now heavily used.
>
> so this indicates that you have a serious ethernet problem, no?
I also think so, and this is my strongest possibility. But IBM does
not accept that there is a error in the hardware, and while I argue
with then about it, I was trying to search for other causes of the
ether problem.
>
> > The myrinet connection was working right, but sometimes a user program
> > just got stuck - one of the processes was sleeping, and all others
> > were running. Then, the program hangs. Investigating this further,
> > this happened with the simple mpich examples like cpi, cpilog, etc. We
> > are using the mx driver version 1.1.6, and mpich-mx 1.2.7..5. mx_info
> > shows all nodes connected when this happens, and the switch did not
> > overheat. mpirun.ch_mx -v shows that all the processes are issued ok
> > to the nodes, but somehow one (or more) process go to sleep or never
> > starts, and all the other processes just hangs. The mx diagnose tools
> > did not show any problem so far, but we still did not have done a
>
> but spawning myrinet jobs normally involves some use of ethernet,
> which has known problems. as I recall, the protocol involves a
> rendezvous ethernet socket managed by the rank0 node. couldn't the
> myrinet-starting problem simply be due to the eth problem, rather than
> anything specific to myrinet?
>
> here's an idea: configure ip-over-myrinet, and use it exclusively
> to start the jobs. if that works, then you know for sure that the
> problem is solely on the eth side (switch, perhaps, or maybe a nic
> that's jabbering or otherwise misbehaving?)
I have configured the ip-over-myrinet, but I'm not sure how to use
exclusively myrinet. I will have to search more about this.
My configuration is as follows: I am using mpich-mx v 1.2.7..5, and
configured all the blades with one ip using ifconfig, like
ifconfig myri0 192.168.30.101
Then, in a file called list, I put
192.168.30.101:4
(each blade has 4 cores).
and ran using
mpich.ch_mx -v -machinefile list -np 4 ./program
This still involves ethernet?
Thank you very much.
--
-----------------------------------------------------------
Ivan S. P. Marin
----------------------------------------------------------
More information about the Beowulf
mailing list