[eepro100] Re: Eepro100 1.36 on Alpha Linux 2.4.2 transmit timeout

Andrey Savochkin saw@saw.sw.com.sg
Wed, 25 Apr 2001 06:42:18 -0700


Hello,

On Mon, Apr 23, 2001 at 02:42:26PM +0200, Cabaniols, Sebastien wrote:
> I am working on a cluster of ES40 Alpha servers (4 cpus) 
> with DE600 boards under Alpha Linux 2.4.2smp. The machines
> have 8 Gigabytes of RAM (this has been an issue with
> the myrinet boards)
[snip]
> 
> I get 
> 
> NETDEV WATCHDOG | eth0 : transmit time out.
> status 0090 0c00 at XXXXX/YYYYY command 000ca000
> wait_for_cmd_done timeout.

On Wed, Apr 25, 2001 at 02:38:58PM +0200, Cabaniols, Sebastien wrote:
> My hardware configuration is:
> 
> 	AlphaServer ES40, 4 cpus, 8 Gigas of RAM
[snip]
> As long as I do not stress too much the network everything is fine. I can
> transfer
> little files but when I do big transfers: I see on the /var/log/messages:
> 
> 	NETDEV WATCHDOG | eth0: transmit timeout
> 				 status 0090 0c00 at xxxxxx/xxxxxx command
> 000ca000
> 				 wait_cmd_done timeout.
> 
> 
> I if instist and launch another transfer, the system freeze, I loose the
> console and the 
> network and I must do a hard reboot.

The timeouts are likely to be a result of a race condition in status word
update.

Try the patch quoted below with the proposed fix of using just
#if defined(__alpha__)

When it comes to a complete freeze of the system, I don't have any ideas why
it may happen.

	Andrey

Date: Tue, 20 Feb 2001 17:26:37 -0500
From: Jay Estabrook <Jay.Estabrook@compaq.com>
To: Matt Wilson <msw@redhat.com>
Cc: Andrey Savochkin <saw@saw.sw.com.sg>, Richard Henderson <rth@redhat.com>,
  Alan Cox <alan@redhat.com>, "Goshdigian, John" <John.Goshdigian@compaq.com>,
  Pat Rago <prago@redhat.com>, George France <budan@excite.com>,
  George France <george.france2@compaq.com>, Preston Brown <pbrown@redhat.com>
Subject: Re: PATCH: eepro100 hangs on Alpha - atomic bit ops
Message-ID: <20010220172637.B2182@linux04.mro.cpqcorp.net>
References: <20010219152247.A22256@saw.sw.com.sg> <20010219194603.A31644@devserv.devel.redhat.com> <20010219164949.A26051@redhat.com> <20010219171117.A23867@saw.sw.com.sg> <20010219171550.A26061@redhat.com> <20010219172406.A23932@saw.sw.com.sg> <20010219173437.A26085@redhat.com> <20010219174114.A24055@saw.sw.com.sg> <20010219174419.B26085@redhat.com> <20010220130313.X9499@devserv.devel.redhat.com>

On Tue, Feb 20, 2001 at 01:03:14PM -0500, Matt Wilson wrote:
>
> OK, new version of the patch attached.

> --- linux/drivers/net/eepro100.c.alpha	Tue Feb 20 12:54:35 2001
> +++ linux/drivers/net/eepro100.c	Tue Feb 20 12:57:33 2001
> @@ -341,14 +341,17 @@
>  /* Clear CmdSuspend (1<<30) avoiding interference with the card access to the
>     status bits.  Previous driver versions used separate 16 bit fields for
>     commands and statuses.  --SAW
> -   FIXME: it may not work on non-IA32 architectures.
>   */
> -#if defined(__LITTLE_ENDIAN)
> -#define clear_suspend(cmd)  ((__u16 *)&(cmd)->cmd_status)[1] &= ~0x4000
> -#elif defined(__BIG_ENDIAN)
> -#define clear_suspend(cmd)  ((__u16 *)&(cmd)->cmd_status)[1] &= ~0x0040
> +#if defined(__alpha__) && !defined (__alpha_bwx__)
> +# define clear_suspend(cmd)  clear_bit(30, &(cmd)->cmd_status);
>  #else
> -#error Unsupported byteorder
> +# if defined(__LITTLE_ENDIAN)
> +#  define clear_suspend(cmd)  ((__u16 *)&(cmd)->cmd_status)[1] &= ~0x4000
> +# elif defined(__BIG_ENDIAN)
> +#  define clear_suspend(cmd)  ((__u16 *)&(cmd)->cmd_status)[1] &= ~0x0040
> +# else
> +#  error Unsupported byteorder
> +# endif
>  #endif

I do NOT believe the above will completely solve the problem.

First, I assume that the cmd->cmd_status[] array is in HOST memory, ie
not PCI memory on the ethernet card. If it *is* PCI memory, AFAIK
there's no way to do atomic update on Alpha. End of discussion.

Second, BWX instructions won't buy you atomicity WRT the above operation.
On *all* Alphas, you MUST use the clear_bit() code.

Thirdly, you MUST guarantee that the clear_bit() operand is aligned
correctly for the operation (I believe it must be a 32-bit quantity,
and thus on a 32-bit ie 4-byte boundary).  If it's not, the
load-locked and store-conditional instructions that are part of the
clear_bit() code will NOT operate correctly.

Bottom line: this

> +#if defined(__alpha__) && !defined (__alpha_bwx__)

should be just

> +#if defined(__alpha__)

--Jay++

-----------------------------------------------------------------------------
Jay A Estabrook                            Alpha Engineering - LINUX Project
Compaq Computer Corp. - MRO1-2/K20         (508) 467-2080
200 Forest Street, Marlboro MA 01752       Jay.Estabrook@compaq.com
-----------------------------------------------------------------------------