[eepro100] Transmitter Timeout

Fri, 30 Jun 2000 21:32:14 GMT

--------------=_4D4800E9C938450574C8
Content-Description: filename="text1.txt"
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I have just finished reading the archives re: what appears to be a=20
rather frustrating issue (Transmitter Timeout). The fact that I was=20
reading the archive should be clue enough it has raised its head here=20
as well. I wanted to pass on some info to the list and see if it helps=20
any of you working the issue.
We have two 4 node linux clusters built from Dell power edge=20
2300/2400's, dual 500mhz, PERCII/SC raid, 1GB ram, and 2 82557's in=20
every box. The first cluster in the US has NEVER seen the timeout=20
problem and has been operational for over a year now. However, our =20
most recent deployment of an identical cluster in Asia is seeing it on=20
a regular basis. All systems currently have an in-house compiled=20
2.2.14 smp kernel and are using eepro100.c  v1.06.as a loadable=20
module. These have been diff'ed many times to verify all the is same=20
everywhere.
The two main differences I have identified are:
First,the working cluster talks to Cisco gear while the other talks to=20
3com gear. To get everything working properly in the states (cisco=20
gear) we are disabling auto-negotiation and forcing 100mbit-FD=20
(options=3D0x30,0x30 in conf.modules) We are doing the same in Asia, but=
=20
this does not appear to be helping.
Second, the hardware in the states is slightly older, Dell 2300's=20
(32-bit PCI) backplane, while the hardware in Asia is the newer.Dell=20
2400's which have both  32bit and 64bit PCI slots. The Intel 82557's=20
are in the 32bit slots.=20
The interesting thing is eth0 (which goes to a 3com switch and then=20
into the core) has never had the problem in Asia. While eth1 that goes=20
directory to the core and is configured as a private vlan for=20
inter-box communication is seeing the problem (Note: I am completely=20
familiar with the details of this configuration, I am repeating what=20
the networking guys have said).
My biggest problem is I have not been able to find a sufficient=20
workaround. Ifup/down does basically nothing. The TX error counters=20
continue to show the same error count after the interface is=20
re-enabled. Also, I cant very easily rmmod since that would require me=20
to down both interfaces under script contol, this makes me slightly=20
nervous since the console is about 7000 miles away from here.
If anyone has any suggestions as to what I should try, what additional=20
information might be helpful, etc,  it would be most appreciated. I am=20
supposed to turn this on live in a week. Considering the private vlan=20
(eth1) is the core of the inter-box communication (see=20
http://www.linuxvirtualserver.org  ) and nfs mounting, I am pretty=20
much screwed if this can not be made to work like things here in the=20
US.

Thanks in advance and I apologize for the excessive length but I=20
wanted to cover as much as possible in one place.
Thanks again.=20
Paul Walker =20

--------------=_4D4800E9C938450574C8
Content-Description: filename="text1.html"
Content-Type: text/html
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">

	Transmitter Timeout

I have just finished reading the archives re: what appears to be a
rather frustrating issue (Transmitter Timeout). The fact that I was
reading the archive should be clue enough it has raised its head here
as well. I wanted to pass on some info to the list and see if it helps
any of you working the issue.
We have two 4 node linux clusters built from Dell power edge
2300/2400's, dual 500mhz, PERCII/SC raid, 1GB ram, and 2 82557's in
every box. The first cluster in the US has NEVER seen the timeout
problem and has been operational for over a year now. However, our=20
most recent deployment of an identical cluster in Asia is seeing it on
a regular basis. All systems currently have an in-house compiled 2.2.14
smp kernel and are using eepro100.c  v1.06.as a loadable module. These
have been diff'ed many times to verify all the is same everywhere.
The two main differences I have identified are:
First,the working cluster talks to Cisco gear while the other talks
to 3com gear. To get everything working properly in the states (cisco
gear) we are disabling auto-negotiation and forcing 100mbit-FD
(options=3D0x30,0x30 in conf.modules) We are doing the same in Asia, but=

this does not appear to be helping.
Second, the hardware in the states is slightly older, Dell 2300's
(32-bit PCI) backplane, while the hardware in Asia is the newer.Dell
2400's which have both  32bit and 64bit PCI slots. The Intel 82557's
are in the 32bit slots.=20

The interesting thing is eth0 (which goes to a 3com switch and then
into the core) has never had the problem in Asia. While eth1 that goes
directory to the core and is configured as a private vlan for inter-box
communication is seeing the problem (Note: I am completely familiar
with the details of this configuration, I am repeating what the
networking guys have said).
My biggest problem is I have not been able to find a sufficient
workaround. Ifup/down does basically nothing. The TX error counters
continue to show the same error count after the interface is
re-enabled. Also, I cant very easily rmmod since that would require me
to down both interfaces under script contol, this makes me slightly
nervous since the console is about 7000 miles away from here.
If anyone has any suggestions as to what I should try, what
additional information might be helpful, etc,  it would be most
appreciated. I am supposed to turn this on live in a week. Considering
the private vlan (eth1) is the core of the inter-box communication (see
http://www.linuxvirtualse=
rver.org
 ) and nfs mounting, I am pretty much screwed if this can not be made
to work like things here in the US.

Thanks in advance and I apologize for the excessive length but I
wanted to cover as much as possible in one place.
Thanks again.=20

Paul Walker =20

--------------=_4D4800E9C938450574C8--