[Beowulf] Problems with a JS21 - Ah, the networking...
Ivan Paganini
ispmarin at gmail.com
Sat Sep 29 03:36:42 PDT 2007
Thank you, Bruce, I will try as soon I have access to the cluster.
I already contacted Myricom support, John, and they are working to try
to solve this, but still no solution to the problem. mx_counters in
the two nodes that I am trying the test mpich programs dont show
anything unusual:
1 ports
Lanai uptime (seconds): 766268 (0xbb13c)
Counters uptime (seconds): 766268 (0xbb13c)
Bad CRC8 (Port 0): 0 (0x0)
Bad CRC32 (Port 0): 0 (0x0)
Unstripped route (Port 0): 0 (0x0)
pkt_desc_invalid (Port 0): 0 (0x0)
recv_pkt_errors (Port 0): 0 (0x0)
pkt_misrouted (Port 0): 0 (0x0)
data_src_unknown: 0 (0x0)
data_bad_endpt: 0 (0x0)
data_endpt_closed: 0 (0x0)
data_bad_session: 0 (0x0)
push_bad_window: 0 (0x0)
push_duplicate: 0 (0x0)
push_obsolete: 0 (0x0)
push_race_driver: 0 (0x0)
push_bad_send_handle_magic: 0 (0x0)
push_bad_src_magic: 0 (0x0)
pull_obsolete: 0 (0x0)
pull_notify_obsolete: 0 (0x0)
pull_race_driver: 0 (0x0)
pull_notify_race: 0 (0x0)
ack_bad_type: 0 (0x0)
ack_bad_magic: 0 (0x0)
ack_resend_race: 0 (0x0)
Late ack: 0 (0x0)
ack_nack_frames_in_pipe: 0 (0x0)
nack_bad_endpt: 0 (0x0)
nack_endpt_closed: 0 (0x0)
nack_bad_session: 0 (0x0)
nack_bad_rdmawin: 0 (0x0)
nack_eventq_full: 0 (0x0)
send_bad_rdmawin: 0 (0x0)
connect_timeout: 0 (0x0)
connect_src_unknown: 0 (0x0)
query_bad_magic: 0 (0x0)
query_timed_out: 0 (0x0)
query_src_unknown: 0 (0x0)
Raw sends (Port 0): 198711 (0x30837)
Raw receives (Port 0): 84612 (0x14a84)
Raw oversized packets (Port 0): 0 (0x0)
raw_recv_overrun: 0 (0x0)
raw_disabled: 0 (0x0)
connect_send: 698 (0x2ba)
connect_recv: 692 (0x2b4)
ack_send (Port 0): 1361 (0x551)
ack_recv (Port 0): 1353 (0x549)
push_send (Port 0): 306 (0x132)
push_recv (Port 0): 0 (0x0)
query_send (Port 0): 114 (0x72)
query_recv (Port 0): 12 (0xc)
reply_send (Port 0): 12 (0xc)
reply_recv (Port 0): 114 (0x72)
query_unknown (Port 0): 0 (0x0)
query_unknown (Port 0): 0 (0x0)
data_send_null (Port 0): 382 (0x17e)
data_send_small (Port 0): 255 (0xff)
data_send_medium (Port 0): 0 (0x0)
data_send_rndv (Port 0): 18 (0x12)
data_send_pull (Port 0): 0 (0x0)
data_recv_null (Port 0): 434 (0x1b2)
data_recv_small_inline (Port 0): 174 (0xae)
data_recv_small_copy (Port 0): 24 (0x18)
data_recv_medium (Port 0): 19 (0x13)
data_recv_rndv (Port 0): 0 (0x0)
data_recv_pull (Port 0): 54 (0x36)
ether_send_unicast_cnt (Port 0): 15990 (0x3e76)
ether_send_multicast_cnt (Port 0): 10 (0xa)
ether_recv_small_cnt (Port 0): 12205 (0x2fad)
ether_recv_big_cnt (Port 0): 5234 (0x1472)
ether_overrun: 0 (0x0)
ether_oversized: 19 (0x13)
data_recv_no_credits: 0 (0x0)
Packets resent: 0 (0x0)
Packets dropped (data send side): 0 (0x0)
Mapper routes update: 64 (0x40)
Route dispersion (Port 0): 0 (0x0)
out_of_send_handles: 0 (0x0)
out_of_pull_handles: 0 (0x0)
out_of_push_handles: 0 (0x0)
medium_cont_race: 0 (0x0)
cmd_type_unknown: 0 (0x0)
ureq_type_unknown: 0 (0x0)
Interrupts overrun: 0 (0x0)
Waiting for interrupt DMA: 0 (0x0)
Waiting for interrupt Ack: 0 (0x0)
Waiting for interrupt Timer: 0 (0x0)
Slabs recycling: 0 (0x0)
Slabs pressure: 0 (0x0)
Slabs starvation: 0 (0x0)
out_of_rdma handles: 0 (0x0)
eventq_full: 0 (0x0)
buffer_drop (Port 0): 0 (0x0)
memory_drop (Port 0): 0 (0x0)
Hardware flow control (Port 0): 0 (0x0)
(Devel) Simulated packets lost (Port 0): 0 (0x0)
(Logging) Logging frames dumped: 0 (0x0)
Wake interrupts: 629 (0x275)
Averted wakeup race: 326 (0x146)
Dma metadata race: 0 (0x0)
foo: 0 (0x0)
mx_endpoints shows there is no connection between any nodes when the
program is not running, and the right number of connections when the
program is running and is just hanged.
I am just waiting to some user programs to finish to then stress the
myrinet, and try to change the driver from 1.1.6 to 1.2.2.
Thank you.
Ivan
2007/9/29, John Hearns <john.hearns at streamline-computing.com>:
> On Fri, 2007-09-28 at 17:43 -0300, Ivan Paganini wrote:
> > Hello everybody,
> >
> > I am beginning to take care of an IBM's JS21. The cluster consists of
>
> > The myrinet connection was working right, but sometimes a user program
> > just got stuck - one of the processes was sleeping, and all others
> > were running. Then, the program hangs.
> >
> > Any suggestions?
>
> Contact Myricom support?
>
> BTW, if you are doing the debugging by yourself, start from the bottom.
> Take two machines, run mx_info, mx_endpoint (should be nothing if no
> programs running) and mx_counters.
> Then do your pingpong and further stress tests as in the README.
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
--
-----------------------------------------------------------
Ivan S. P. Marin
----------------------------------------------------------
More information about the Beowulf
mailing list