[tulip] tulip based cluster on Cisco switch

Brian D. Haymore brian@chpc.utah.edu
Sat, 10 Jun 2000 09:15:18 -0600 (MDT)


Thanks for the tip.  We have been quite attentive to our nics and their
link state.  But with two years of life into this cluster we have yet to
see very many really show stopping problems.  You might want to verify
that you have the latest cisco software in your switch.

--
Brian D. Haymore
University of Utah
Center for High Performance Computing
155 South 1452 East RM 405
Salt Lake City, Ut 84112-0190

Email: brian@chpc.utah.edu - Phone: (801) 585-1755 - Fax: (801) 585-5366

On Sat, 10 Jun 2000, David Thompson wrote:

> 
> I'll fess up here; we're Michael's customer in this case.  We are also seeing 
> the changed MAC address with the .92 driver (vendor code 00:40:f0 instead of 
> 00:c0:f0).  We put the new mac address in our dhcp server but still couldn't 
> dhcp with the new driver.  The server is sending out replies, but the client 
> doesn't seem to be getting them.  Caveat for anyone with dhcp and tulip 0.92...
> 
> Our sole reason for trying the .92 driver was see if we could use the  
> 'options=' goo to force the whole cluster to 100BaseTX/fdx.  We have not been 
> able to get these cards to auto-negotiaite properly and/or reliably with the 
> Cisco switch with either the 'old' or 'new' transceivers, with either tulip 
> 0.91 or 0.92.  Our hope was to forgoe auto-negotiation (which we normally 
> prefer) because it seems to be borked with the hardware we have.  However, all 
> attemps to force speed and duplex with either tulip driver version have failed.
> 
> Brian, you may want to check your 'netstat -i' from time to time, and/or make 
> a pass through your cluster with tulip-diag to see if all your NICs are truly 
> auto-negotiating properly with the switch(es).  We have seen situations with 
> the where the auto-negotiation originally succeeds, but a couple days later we 
> find the switch running 100/full and the card 100/half.  This causes bad 
> things to happen wrt network performance.
> 
> --
> Dave Thompson  <thomas@cs.wisc.edu>
> 
> Associate Researcher                    Department of Computer Science
> University of Wisconsin-Madison         http://www.cs.wisc.edu/~thomas
> 1210 West Dayton Street                 Phone:    (608)-262-1017
> Madison, WI 53706-1685                  Fax:      (608)-262-6626
> --
> 
> 
> 
> 
> "Brian D. Haymore" wrote:
> >We have a 170 node beowulf cluster all using this same network card.  We
> >have found that many versions of the tulip driver produce the exact
> >results you have seen.  Currently we have found that version .91 to be
> >the most reliable for us.  Redhat 6.2 comes with a slightly newer
> >version and we had issues with that.  We also tried the .92 version in
> >module form from Donald Becker's site and found that this driver somehow
> >got a completly different mac address for the card (wierd huh!).  We
> >reported that bug and have not heard anything more beyond that.  So at
> >is is we are still using .91 without any issues we can find.  We also
> >are getting up to ~92 Mb/s.  Hope this helps.
> >
> >Michael Immonen wrote:
> >> 
> >> Hi all,
> >> 
> >> We have been struggling with an issue that has taken
> >> many weeks to nearly solve. I am looking for a bit
> >> more advice to bring this to a close.
> >> 
> >> Background:
> >> We have an 100 node cluster, all with Kingston
> >> KNE100TX NIC's. It is split it into two- 64 nodes and
> >> 36
> >> nodes. Each set is attached to its own Cisco Catalyst
> >> 4000.
> >> These systems were seeing several symptoms that we
> >> eventually tied together:
> >> 1. Ridiculously slow data transfer speeds- with the
> >> nodes and switch configured for 100Mbps, data was
> >> transferring at well below 10Mbps- varied in actual
> >> value.
> >> 2. Discovered severe packet loss due to carrier errors
> >> as reported in /proc/net/dev Again highly
> >> variable- poorly performing nodes could be anywhere
> >> from 0.10% to 94.00% of transmitted packets
> >> had carrier errors.
> >> 3. All affected nodes were discovered to be operating
> >> in half duplex where the switch and good nodes
> >> were in full duplex. This was discovered using
> >> tulip-diag.c
> >> 
> >> We thought we had a final solution when Kingston
> >> assisted us in tracking a known hardware issue
> >> related to some versions of the SEEQ MII transceiver.
> >> They informed us that, under Linux, and with
> >> some switches, there were several versions of the SEEQ
> >> chip that had intermittent "timing issues". The
> >> SEEQ (or LSI) 80223/C were known good chips, but the
> >> SEEQ 80220/G and 80223/B would sometimes display this
> >> behavior. The tricky part is that in some cases, they
> >> were perfectly fine. Kingston did an excellent job
> >> assisting us with the replacement of all 100 NIC's.
> >> 
> >> After all cards were swapped and the cluster was again
> >> up and running, everything was beautiful- 100
> >> FD all around. End of issue, or so we thought.
> >> 
> >> Nearly a week later, on checking the systems, 16 nodes
> >> between the two sets were discovered to be
> >> again in half duplex. (But with MUCH lower carrier
> >> errors- 0.01% to 0.09%) And just a couple days
> >> more the whole cluster was reported to be HD.
> >> 
> >> All systems had been running kernel 2.2.12 with
> >> tulip.c v0.91, one system was updated to 2.2.15 with
> >> v0.92, but this did not solve the issue.
> >> 
> >> I have spent some time scanning the tulip lists and
> >> have gained some information there, but now also
> >> have some more questions...
> >> 
> >> Now, for my questions:
> >> Why would a running system renegotiate its network
> >> setting without user intervention?
> >> 
> >> I am assuming that the current problem has to do with
> >> the Cisco switch issue that Donald Becker
> >> mentioned on April 27, 2000 on this list. If this is
> >> the case would another type of ethernet chip still
> >> experience the same problems?
> >> 
> >> Donald Becker has stated that forcing speed and duplex
> >> is not recommended. Could forcing these cards
> >> to 100 FD for this issue be a safe solution?
> >> I have read that using HD is better. Why?
> >> Should the nodes be forced to 100 HD?
> >> 
> >> Does anyone have any other recommendations or advice?
> >> Maybe a recommended switch to replace the Cisco
> >> switches?
> >> 
> >> Regards,
> >> Michael
> >> 
> >> __________________________________________________
> >> Do You Yahoo!?
> >> Yahoo! Photos -- now, 100 FREE prints!
> >> http://photos.yahoo.com
> >> 
> >> _______________________________________________
> >> tulip mailing list
> >> tulip@scyld.com
> >> http://www.scyld.com/mailman/listinfo/tulip
> >
> >--
> >Brian D. Haymore
> >University of Utah
> >Center for High Performance Computing
> >155 South 1452 East RM 405
> >Salt Lake City, Ut 84112-0190
> >
> >Email: brian@chpc.utah.edu - Phone: (801) 585-1755 - Fax: (801) 585-5366
> >
> >_______________________________________________
> >tulip mailing list
> >tulip@scyld.com
> >http://www.scyld.com/mailman/listinfo/tulip
> 
>