[tulip] tulip based cluster on Cisco switch

Fri, 9 Jun 2000 13:28:23 -0700 (PDT)

Hi all,

We have been struggling with an issue that has taken
many weeks to nearly solve. I am looking for a bit 
more advice to bring this to a close.

Background:
We have an 100 node cluster, all with Kingston
KNE100TX NIC's. It is split it into two- 64 nodes and
36 
nodes. Each set is attached to its own Cisco Catalyst
4000.
These systems were seeing several symptoms that we
eventually tied together:
1. Ridiculously slow data transfer speeds- with the
nodes and switch configured for 100Mbps, data was 
transferring at well below 10Mbps- varied in actual
value.
2. Discovered severe packet loss due to carrier errors
as reported in /proc/net/dev Again highly 
variable- poorly performing nodes could be anywhere
from 0.10% to 94.00% of transmitted packets 
had carrier errors.
3. All affected nodes were discovered to be operating
in half duplex where the switch and good nodes 
were in full duplex. This was discovered using
tulip-diag.c

We thought we had a final solution when Kingston
assisted us in tracking a known hardware issue 
related to some versions of the SEEQ MII transceiver.
They informed us that, under Linux, and with 
some switches, there were several versions of the SEEQ
chip that had intermittent "timing issues". The 
SEEQ (or LSI) 80223/C were known good chips, but the
SEEQ 80220/G and 80223/B would sometimes display this
behavior. The tricky part is that in some cases, they
were perfectly fine. Kingston did an excellent job
assisting us with the replacement of all 100 NIC's.

After all cards were swapped and the cluster was again
up and running, everything was beautiful- 100 
FD all around. End of issue, or so we thought.

Nearly a week later, on checking the systems, 16 nodes
between the two sets were discovered to be 
again in half duplex. (But with MUCH lower carrier
errors- 0.01% to 0.09%) And just a couple days 
more the whole cluster was reported to be HD.

All systems had been running kernel 2.2.12 with
tulip.c v0.91, one system was updated to 2.2.15 with 
v0.92, but this did not solve the issue.

I have spent some time scanning the tulip lists and
have gained some information there, but now also 
have some more questions...

Now, for my questions:
Why would a running system renegotiate its network
setting without user intervention?

I am assuming that the current problem has to do with
the Cisco switch issue that Donald Becker 
mentioned on April 27, 2000 on this list. If this is
the case would another type of ethernet chip still 
experience the same problems?

Donald Becker has stated that forcing speed and duplex
is not recommended. Could forcing these cards 
to 100 FD for this issue be a safe solution?
I have read that using HD is better. Why?
Should the nodes be forced to 100 HD?

Does anyone have any other recommendations or advice?
Maybe a recommended switch to replace the Cisco
switches?

Regards,
Michael

__________________________________________________
Do You Yahoo!?
Yahoo! Photos -- now, 100 FREE prints!
http://photos.yahoo.com