[Beowulf] Problems with a JS21 - Ah, the networking...

Ivan Paganini ispmarin at gmail.com
Sat Sep 29 03:36:42 PDT 2007


Thank you, Bruce, I will try as soon I have access to the cluster.

I already contacted Myricom support, John, and they are working to try
to solve this, but still no solution to the problem. mx_counters in
the two nodes that I am trying the test mpich programs dont show
anything unusual:

1 ports
            Lanai uptime (seconds):     766268 (0xbb13c)
         Counters uptime (seconds):     766268 (0xbb13c)
                 Bad CRC8 (Port 0):          0 (0x0)
                Bad CRC32 (Port 0):          0 (0x0)
         Unstripped route (Port 0):          0 (0x0)
         pkt_desc_invalid (Port 0):          0 (0x0)
          recv_pkt_errors (Port 0):          0 (0x0)
            pkt_misrouted (Port 0):          0 (0x0)
                  data_src_unknown:          0 (0x0)
                    data_bad_endpt:          0 (0x0)
                 data_endpt_closed:          0 (0x0)
                  data_bad_session:          0 (0x0)
                   push_bad_window:          0 (0x0)
                    push_duplicate:          0 (0x0)
                     push_obsolete:          0 (0x0)
                  push_race_driver:          0 (0x0)
        push_bad_send_handle_magic:          0 (0x0)
                push_bad_src_magic:          0 (0x0)
                     pull_obsolete:          0 (0x0)
              pull_notify_obsolete:          0 (0x0)
                  pull_race_driver:          0 (0x0)
                  pull_notify_race:          0 (0x0)
                      ack_bad_type:          0 (0x0)
                     ack_bad_magic:          0 (0x0)
                   ack_resend_race:          0 (0x0)
                          Late ack:          0 (0x0)
           ack_nack_frames_in_pipe:          0 (0x0)
                    nack_bad_endpt:          0 (0x0)
                 nack_endpt_closed:          0 (0x0)
                  nack_bad_session:          0 (0x0)
                  nack_bad_rdmawin:          0 (0x0)
                  nack_eventq_full:          0 (0x0)
                  send_bad_rdmawin:          0 (0x0)
                   connect_timeout:          0 (0x0)
               connect_src_unknown:          0 (0x0)
                   query_bad_magic:          0 (0x0)
                   query_timed_out:          0 (0x0)
                 query_src_unknown:          0 (0x0)
                Raw sends (Port 0):     198711 (0x30837)
             Raw receives (Port 0):      84612 (0x14a84)
    Raw oversized packets (Port 0):          0 (0x0)
                  raw_recv_overrun:          0 (0x0)
                      raw_disabled:          0 (0x0)
                      connect_send:        698 (0x2ba)
                      connect_recv:        692 (0x2b4)
                 ack_send (Port 0):       1361 (0x551)
                 ack_recv (Port 0):       1353 (0x549)
                push_send (Port 0):        306 (0x132)
                push_recv (Port 0):          0 (0x0)
               query_send (Port 0):        114 (0x72)
               query_recv (Port 0):         12 (0xc)
               reply_send (Port 0):         12 (0xc)
               reply_recv (Port 0):        114 (0x72)
            query_unknown (Port 0):          0 (0x0)
            query_unknown (Port 0):          0 (0x0)
           data_send_null (Port 0):        382 (0x17e)
          data_send_small (Port 0):        255 (0xff)
         data_send_medium (Port 0):          0 (0x0)
           data_send_rndv (Port 0):         18 (0x12)
           data_send_pull (Port 0):          0 (0x0)
           data_recv_null (Port 0):        434 (0x1b2)
   data_recv_small_inline (Port 0):        174 (0xae)
     data_recv_small_copy (Port 0):         24 (0x18)
         data_recv_medium (Port 0):         19 (0x13)
           data_recv_rndv (Port 0):          0 (0x0)
           data_recv_pull (Port 0):         54 (0x36)
   ether_send_unicast_cnt (Port 0):      15990 (0x3e76)
 ether_send_multicast_cnt (Port 0):         10 (0xa)
     ether_recv_small_cnt (Port 0):      12205 (0x2fad)
       ether_recv_big_cnt (Port 0):       5234 (0x1472)
                     ether_overrun:          0 (0x0)
                   ether_oversized:         19 (0x13)
              data_recv_no_credits:          0 (0x0)
                    Packets resent:          0 (0x0)
  Packets dropped (data send side):          0 (0x0)
              Mapper routes update:         64 (0x40)
         Route dispersion (Port 0):          0 (0x0)
               out_of_send_handles:          0 (0x0)
               out_of_pull_handles:          0 (0x0)
               out_of_push_handles:          0 (0x0)
                  medium_cont_race:          0 (0x0)
                  cmd_type_unknown:          0 (0x0)
                 ureq_type_unknown:          0 (0x0)
                Interrupts overrun:          0 (0x0)
         Waiting for interrupt DMA:          0 (0x0)
         Waiting for interrupt Ack:          0 (0x0)
       Waiting for interrupt Timer:          0 (0x0)
                   Slabs recycling:          0 (0x0)
                    Slabs pressure:          0 (0x0)
                  Slabs starvation:          0 (0x0)
               out_of_rdma handles:          0 (0x0)
                       eventq_full:          0 (0x0)
              buffer_drop (Port 0):          0 (0x0)
              memory_drop (Port 0):          0 (0x0)
    Hardware flow control (Port 0):          0 (0x0)
(Devel) Simulated packets lost (Port 0):          0 (0x0)
   (Logging) Logging frames dumped:          0 (0x0)
                   Wake interrupts:        629 (0x275)
               Averted wakeup race:        326 (0x146)
                 Dma metadata race:          0 (0x0)
                               foo:          0 (0x0)

mx_endpoints shows there is no connection between any nodes when the
program is not running, and the right number of connections when the
program is running and is just hanged.

I am just waiting to some user programs to finish to then stress the
myrinet, and try to change the driver from 1.1.6 to 1.2.2.

Thank you.

Ivan

2007/9/29, John Hearns <john.hearns at streamline-computing.com>:
> On Fri, 2007-09-28 at 17:43 -0300, Ivan Paganini wrote:
> > Hello everybody,
> >
> > I am beginning to take care of an IBM's JS21. The cluster consists of
>
> > The myrinet connection was working right, but sometimes a user program
> > just got stuck - one of the processes was sleeping, and all others
> > were running. Then, the program hangs.
> >
> > Any suggestions?
>
> Contact Myricom support?
>
> BTW, if you are doing the debugging by yourself, start from the bottom.
> Take two machines, run mx_info, mx_endpoint (should be nothing if no
> programs running) and mx_counters.
> Then do your pingpong and further stress tests as in the README.
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>


-- 
-----------------------------------------------------------
Ivan S. P. Marin
----------------------------------------------------------



More information about the Beowulf mailing list