[tulip] tulip based cluster on Cisco switch
Brian D. Haymore
brian@chpc.utah.edu
Sat, 10 Jun 2000 23:38:26 -0600 (MDT)
Most of our nics are 21143-pd nics and we see near no issues with the .91
drivers. So I dont' think they are hopeless at all, just stuborn.
--
Brian D. Haymore
University of Utah
Center for High Performance Computing
155 South 1452 East RM 405
Salt Lake City, Ut 84112-0190
Email: brian@chpc.utah.edu - Phone: (801) 585-1755 - Fax: (801) 585-5366
On Sat, 10 Jun 2000, Homer Wilson Smith wrote:
>
> Also verify whether you are using 21143 or 21140 chips.
>
> The 21140 are well behaved as far as I can tell, the 21143
> are hopeless.
>
> Homer
>
> ------------------------------------------------------------------------
> Homer Wilson Smith Clear Air, Clear Water, Art Matrix - Lightlink
> (607) 277-0959 A Green Earth and Peace. Internet Access, Ithaca NY
> homer@lightlink.com Is that too much to ask? http://www.lightlink.com
>
> On Sat, 10 Jun 2000, Brian D. Haymore wrote:
>
> > Thanks for the tip. We have been quite attentive to our nics and their
> > link state. But with two years of life into this cluster we have yet to
> > see very many really show stopping problems. You might want to verify
> > that you have the latest cisco software in your switch.
> >
> > --
> > Brian D. Haymore
> > University of Utah
> > Center for High Performance Computing
> > 155 South 1452 East RM 405
> > Salt Lake City, Ut 84112-0190
> >
> > Email: brian@chpc.utah.edu - Phone: (801) 585-1755 - Fax: (801) 585-5366
> >
> > On Sat, 10 Jun 2000, David Thompson wrote:
> >
> > >
> > > I'll fess up here; we're Michael's customer in this case. We are also seeing
> > > the changed MAC address with the .92 driver (vendor code 00:40:f0 instead of
> > > 00:c0:f0). We put the new mac address in our dhcp server but still couldn't
> > > dhcp with the new driver. The server is sending out replies, but the client
> > > doesn't seem to be getting them. Caveat for anyone with dhcp and tulip 0.92...
> > >
> > > Our sole reason for trying the .92 driver was see if we could use the
> > > 'options=' goo to force the whole cluster to 100BaseTX/fdx. We have not been
> > > able to get these cards to auto-negotiaite properly and/or reliably with the
> > > Cisco switch with either the 'old' or 'new' transceivers, with either tulip
> > > 0.91 or 0.92. Our hope was to forgoe auto-negotiation (which we normally
> > > prefer) because it seems to be borked with the hardware we have. However, all
> > > attemps to force speed and duplex with either tulip driver version have failed.
> > >
> > > Brian, you may want to check your 'netstat -i' from time to time, and/or make
> > > a pass through your cluster with tulip-diag to see if all your NICs are truly
> > > auto-negotiating properly with the switch(es). We have seen situations with
> > > the where the auto-negotiation originally succeeds, but a couple days later we
> > > find the switch running 100/full and the card 100/half. This causes bad
> > > things to happen wrt network performance.
> > >
> > > --
> > > Dave Thompson <thomas@cs.wisc.edu>
> > >
> > > Associate Researcher Department of Computer Science
> > > University of Wisconsin-Madison http://www.cs.wisc.edu/~thomas
> > > 1210 West Dayton Street Phone: (608)-262-1017
> > > Madison, WI 53706-1685 Fax: (608)-262-6626
> > > --
> > >
> > >
> > >
> > >
> > > "Brian D. Haymore" wrote:
> > > >We have a 170 node beowulf cluster all using this same network card. We
> > > >have found that many versions of the tulip driver produce the exact
> > > >results you have seen. Currently we have found that version .91 to be
> > > >the most reliable for us. Redhat 6.2 comes with a slightly newer
> > > >version and we had issues with that. We also tried the .92 version in
> > > >module form from Donald Becker's site and found that this driver somehow
> > > >got a completly different mac address for the card (wierd huh!). We
> > > >reported that bug and have not heard anything more beyond that. So at
> > > >is is we are still using .91 without any issues we can find. We also
> > > >are getting up to ~92 Mb/s. Hope this helps.
> > > >
> > > >Michael Immonen wrote:
> > > >>
> > > >> Hi all,
> > > >>
> > > >> We have been struggling with an issue that has taken
> > > >> many weeks to nearly solve. I am looking for a bit
> > > >> more advice to bring this to a close.
> > > >>
> > > >> Background:
> > > >> We have an 100 node cluster, all with Kingston
> > > >> KNE100TX NIC's. It is split it into two- 64 nodes and
> > > >> 36
> > > >> nodes. Each set is attached to its own Cisco Catalyst
> > > >> 4000.
> > > >> These systems were seeing several symptoms that we
> > > >> eventually tied together:
> > > >> 1. Ridiculously slow data transfer speeds- with the
> > > >> nodes and switch configured for 100Mbps, data was
> > > >> transferring at well below 10Mbps- varied in actual
> > > >> value.
> > > >> 2. Discovered severe packet loss due to carrier errors
> > > >> as reported in /proc/net/dev Again highly
> > > >> variable- poorly performing nodes could be anywhere
> > > >> from 0.10% to 94.00% of transmitted packets
> > > >> had carrier errors.
> > > >> 3. All affected nodes were discovered to be operating
> > > >> in half duplex where the switch and good nodes
> > > >> were in full duplex. This was discovered using
> > > >> tulip-diag.c
> > > >>
> > > >> We thought we had a final solution when Kingston
> > > >> assisted us in tracking a known hardware issue
> > > >> related to some versions of the SEEQ MII transceiver.
> > > >> They informed us that, under Linux, and with
> > > >> some switches, there were several versions of the SEEQ
> > > >> chip that had intermittent "timing issues". The
> > > >> SEEQ (or LSI) 80223/C were known good chips, but the
> > > >> SEEQ 80220/G and 80223/B would sometimes display this
> > > >> behavior. The tricky part is that in some cases, they
> > > >> were perfectly fine. Kingston did an excellent job
> > > >> assisting us with the replacement of all 100 NIC's.
> > > >>
> > > >> After all cards were swapped and the cluster was again
> > > >> up and running, everything was beautiful- 100
> > > >> FD all around. End of issue, or so we thought.
> > > >>
> > > >> Nearly a week later, on checking the systems, 16 nodes
> > > >> between the two sets were discovered to be
> > > >> again in half duplex. (But with MUCH lower carrier
> > > >> errors- 0.01% to 0.09%) And just a couple days
> > > >> more the whole cluster was reported to be HD.
> > > >>
> > > >> All systems had been running kernel 2.2.12 with
> > > >> tulip.c v0.91, one system was updated to 2.2.15 with
> > > >> v0.92, but this did not solve the issue.
> > > >>
> > > >> I have spent some time scanning the tulip lists and
> > > >> have gained some information there, but now also
> > > >> have some more questions...
> > > >>
> > > >> Now, for my questions:
> > > >> Why would a running system renegotiate its network
> > > >> setting without user intervention?
> > > >>
> > > >> I am assuming that the current problem has to do with
> > > >> the Cisco switch issue that Donald Becker
> > > >> mentioned on April 27, 2000 on this list. If this is
> > > >> the case would another type of ethernet chip still
> > > >> experience the same problems?
> > > >>
> > > >> Donald Becker has stated that forcing speed and duplex
> > > >> is not recommended. Could forcing these cards
> > > >> to 100 FD for this issue be a safe solution?
> > > >> I have read that using HD is better. Why?
> > > >> Should the nodes be forced to 100 HD?
> > > >>
> > > >> Does anyone have any other recommendations or advice?
> > > >> Maybe a recommended switch to replace the Cisco
> > > >> switches?
> > > >>
> > > >> Regards,
> > > >> Michael
> > > >>
> > > >> __________________________________________________
> > > >> Do You Yahoo!?
> > > >> Yahoo! Photos -- now, 100 FREE prints!
> > > >> http://photos.yahoo.com
> > > >>
> > > >> _______________________________________________
> > > >> tulip mailing list
> > > >> tulip@scyld.com
> > > >> http://www.scyld.com/mailman/listinfo/tulip
> > > >
> > > >--
> > > >Brian D. Haymore
> > > >University of Utah
> > > >Center for High Performance Computing
> > > >155 South 1452 East RM 405
> > > >Salt Lake City, Ut 84112-0190
> > > >
> > > >Email: brian@chpc.utah.edu - Phone: (801) 585-1755 - Fax: (801) 585-5366
> > > >
> > > >_______________________________________________
> > > >tulip mailing list
> > > >tulip@scyld.com
> > > >http://www.scyld.com/mailman/listinfo/tulip
> > >
> > >
> >
> >
> > _______________________________________________
> > tulip mailing list
> > tulip@scyld.com
> > http://www.scyld.com/mailman/listinfo/tulip
> >
>