[tulip] tulip based cluster on Cisco switch

Brian D. Haymore brian@chpc.utah.edu
Fri, 09 Jun 2000 14:54:17 -0600


We have a 170 node beowulf cluster all using this same network card.  We
have found that many versions of the tulip driver produce the exact
results you have seen.  Currently we have found that version .91 to be
the most reliable for us.  Redhat 6.2 comes with a slightly newer
version and we had issues with that.  We also tried the .92 version in
module form from Donald Becker's site and found that this driver somehow
got a completly different mac address for the card (wierd huh!).  We
reported that bug and have not heard anything more beyond that.  So at
is is we are still using .91 without any issues we can find.  We also
are getting up to ~92 Mb/s.  Hope this helps.

Michael Immonen wrote:
> 
> Hi all,
> 
> We have been struggling with an issue that has taken
> many weeks to nearly solve. I am looking for a bit
> more advice to bring this to a close.
> 
> Background:
> We have an 100 node cluster, all with Kingston
> KNE100TX NIC's. It is split it into two- 64 nodes and
> 36
> nodes. Each set is attached to its own Cisco Catalyst
> 4000.
> These systems were seeing several symptoms that we
> eventually tied together:
> 1. Ridiculously slow data transfer speeds- with the
> nodes and switch configured for 100Mbps, data was
> transferring at well below 10Mbps- varied in actual
> value.
> 2. Discovered severe packet loss due to carrier errors
> as reported in /proc/net/dev Again highly
> variable- poorly performing nodes could be anywhere
> from 0.10% to 94.00% of transmitted packets
> had carrier errors.
> 3. All affected nodes were discovered to be operating
> in half duplex where the switch and good nodes
> were in full duplex. This was discovered using
> tulip-diag.c
> 
> We thought we had a final solution when Kingston
> assisted us in tracking a known hardware issue
> related to some versions of the SEEQ MII transceiver.
> They informed us that, under Linux, and with
> some switches, there were several versions of the SEEQ
> chip that had intermittent "timing issues". The
> SEEQ (or LSI) 80223/C were known good chips, but the
> SEEQ 80220/G and 80223/B would sometimes display this
> behavior. The tricky part is that in some cases, they
> were perfectly fine. Kingston did an excellent job
> assisting us with the replacement of all 100 NIC's.
> 
> After all cards were swapped and the cluster was again
> up and running, everything was beautiful- 100
> FD all around. End of issue, or so we thought.
> 
> Nearly a week later, on checking the systems, 16 nodes
> between the two sets were discovered to be
> again in half duplex. (But with MUCH lower carrier
> errors- 0.01% to 0.09%) And just a couple days
> more the whole cluster was reported to be HD.
> 
> All systems had been running kernel 2.2.12 with
> tulip.c v0.91, one system was updated to 2.2.15 with
> v0.92, but this did not solve the issue.
> 
> I have spent some time scanning the tulip lists and
> have gained some information there, but now also
> have some more questions...
> 
> Now, for my questions:
> Why would a running system renegotiate its network
> setting without user intervention?
> 
> I am assuming that the current problem has to do with
> the Cisco switch issue that Donald Becker
> mentioned on April 27, 2000 on this list. If this is
> the case would another type of ethernet chip still
> experience the same problems?
> 
> Donald Becker has stated that forcing speed and duplex
> is not recommended. Could forcing these cards
> to 100 FD for this issue be a safe solution?
> I have read that using HD is better. Why?
> Should the nodes be forced to 100 HD?
> 
> Does anyone have any other recommendations or advice?
> Maybe a recommended switch to replace the Cisco
> switches?
> 
> Regards,
> Michael
> 
> __________________________________________________
> Do You Yahoo!?
> Yahoo! Photos -- now, 100 FREE prints!
> http://photos.yahoo.com
> 
> _______________________________________________
> tulip mailing list
> tulip@scyld.com
> http://www.scyld.com/mailman/listinfo/tulip

--
Brian D. Haymore
University of Utah
Center for High Performance Computing
155 South 1452 East RM 405
Salt Lake City, Ut 84112-0190

Email: brian@chpc.utah.edu - Phone: (801) 585-1755 - Fax: (801) 585-5366