[Beowulf] Problems with a JS21 - Ah, the networking...

Ivan Paganini ispmarin at gmail.com
Mon Oct 1 06:05:43 PDT 2007


Just a update: trying several times, the strace stops in different
points, the speficied in the other email and here:
_______________________________________________
munmap(0x40176000, 4096)                = 0
time([1191243868])                      = 1191243868
open("/etc/hosts", O_RDONLY)            = 4
fcntl64(4, F_GETFD)                     = 0
fcntl64(4, F_SETFD, FD_CLOEXEC)         = 0
fstat64(4, {st_mode=S_IFREG|0644, st_size=10247, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x40176000
read(4, "#\n# hosts         This file desc"..., 4096) = 4096
read(4, "yriBlade077\n192.168.30.178  myri"..., 4096) = 4096
read(4, " blade067 blade067.lcca.usp.br\n1"..., 4096) = 2055
read(4, "", 4096)                       = 0
close(4)                                = 0
munmap(0x40176000, 4096)                = 0
clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x40046f68) = 31382
clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x40046f68) = 31383
brk(0x102ab000)                         = 0x102ab000
clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x40046f68) = 31384
waitpid(-1,
_______________________________________________

Thanks.

2007/10/1, Ivan Paganini <ispmarin at gmail.com>:
> Hello Chris, everybody:
>
> I am not using jumbo frames, and I'm now considering this option, but
> first I wanted to know for sure that there is no other problem before,
> just to control the number of variables at hand. But thanks for your
> help.
>
> I did a strace on the hanged process, and the output is this:
> ______________________________________________
>
> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x401
> 76000
> read(4, "#\n# hosts         This file desc"..., 4096) = 4096
> read(4, "yriBlade077\n192.168.30.178  myri"..., 4096) = 4096
> read(4, " blade067 blade067.lcca.usp.br\n1"..., 4096) = 2055
> read(4, "", 4096)                       = 0
> close(4)                                = 0
> munmap(0x40176000, 4096)                = 0
> clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, chil
> d_tidptr=0x40046f68) = 25994
> clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, chil
> d_tidptr=0x40046f68) = 25995
> brk(0x102ab000)                         = 0x102ab000
> clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, chil
> d_tidptr=0x40046f68) = 25996
> waitpid(-1, 0xffffdbc8, 0)              = ? ERESTARTSYS (To be restarted)
> --- SIGWINCH (Window changed) @ 0 (0) ---
> waitpid(-1, 0xffffdbc8, 0)              = ? ERESTARTSYS (To be restarted)
> --- SIGWINCH (Window changed) @ 0 (0) ---
> waitpid(-1, 0xffffdbc8, 0)              = ? ERESTARTSYS (To be restarted)
> --- SIGWINCH (Window changed) @ 0 (0) ---
> waitpid(-1, 0xffffdbc8, 0)              = ? ERESTARTSYS (To be restarted)
> --- SIGWINCH (Window changed) @ 0 (0) ---
> waitpid(-1, 0xffffdbc8, 0)              = ? ERESTARTSYS (To be restarted)
> --- SIGWINCH (Window changed) @ 0 (0) ---
> waitpid(-1, 0xffffdbc8, 0)              = ? ERESTARTSYS (To be restarted)
> --- SIGWINCH (Window changed) @ 0 (0) ---
> waitpid(-1, 0xffffdbc8, 0)              = ? ERESTARTSYS (To be restarted)
> --- SIGWINCH (Window changed) @ 0 (0) ---
> waitpid(-1, 0xffffdbc8, 0)              = ? ERESTARTSYS (To be restarted)
> --- SIGWINCH (Window changed) @ 0 (0) ---
> waitpid(-1, 0xffffdbc8, 0)              = ? ERESTARTSYS (To be restarted)
> --- SIGWINCH (Window changed) @ 0 (0) ---
> waitpid(-1, 0xffffdbc8, 0)              = ? ERESTARTSYS (To be restarted)
> --- SIGWINCH (Window changed) @ 0 (0) ---
> waitpid(-1, 0xffffdbc8, 0)              = ? ERESTARTSYS (To be restarted)
> --- SIGWINCH (Window changed) @ 0 (0) ---
> waitpid(-1, 0xffffdbc8, 0)              = ? ERESTARTSYS (To be restarted)
> --- SIGWINCH (Window changed) @ 0 (0) ---
> waitpid(-1, 0xffffdbc8, 0)              = ? ERESTARTSYS (To be restarted)
> --- SIGWINCH (Window changed) @ 0 (0) ---
> waitpid(-1, 0xffffdbc8, 0)              = ? ERESTARTSYS (To be restarted)
> --- SIGWINCH (Window changed) @ 0 (0) ---
> waitpid(-1, 0xffffdbc8, 0)              = ? ERESTARTSYS (To be restarted)
> --- SIGWINCH (Window changed) @ 0 (0) ---
> waitpid(-1,
>
> ______________________________________________
> and just that. I'm now trying to make a better undestanding that what
> is happening.
>
> Thank you.
>
> Ivan
>
>
> 2007/9/29, Chris Samuel <csamuel at vpac.org>:
> > On Sat, 29 Sep 2007, Ivan Paganini wrote:
> >
> > > I sniffed the network in the store nodes interface, and i got lots
> > > of TCP lost fragment, previos lost fragments, ack lost fragments
> > > and TCP window size full.
> >
> > Some suggestions would be to check that all network interfaces are
> > negotiating gigabit back to the switch, and that if you are using
> > jumbo frames then all interfaces are indeed using jumbo frames.
> >
> > A useful check to verify 2 way jumbo frames connectivity is by using
> > the ping command, doing:
> >
> > ping -c 1 -M do -s 8900 $hostname
> >
> > should tell you whether or not it is working.
> >
> > Best of luck!
> > Chris
> > --
> > Christopher Samuel - (03) 9925 4751 - Systems Manager
> >  The Victorian Partnership for Advanced Computing
> >  P.O. Box 201, Carlton South, VIC 3053, Australia
> > VPAC is a not-for-profit Registered Research Agency
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> >
>
>
> --
> -----------------------------------------------------------
> Ivan S. P. Marin
> ----------------------------------------------------------
>


-- 
-----------------------------------------------------------
Ivan S. P. Marin
----------------------------------------------------------



More information about the Beowulf mailing list