[vortex] Re: [vortex-bug] DPD FSH for Cyclone/Tornado
Donald Becker
becker@scyld.com
Tue, 16 May 2000 15:30:33 -0400 (EDT)
[[ Note: This discussion moved to linux-vortex from linux-vortex-bug. ]]
On Tue, 16 May 2000, Bogdan Costescu wrote:
> > vp->tx_ring[entry].status = cpu_to_le32(skb->len);
>
> You got this wrong:---------------------------------------^^^^^^
> I suggested:
> vp->tx_ring[entry].status = cpu_to_le32(TxIntrUploaded);
Doh! I did M-C-k and didn't read the result. Not enough sleep.
> > They look similar, but the calculation of
> > if (vp->drv_flags & IS_BOOMERANG)
> > adds instructions to shuffle values on the register-poor x86.
>
> I don't quite understand this: the test is done in both cases, you only
> make the code on each branch a little longer. This does affect the cache
> impact.
The subtle difference is that on some processors, with a good compiler,
foo[entry].ele0 = 12;
foo[entry].ele1 = 23;
foo[entry].ele2 = 45;
foo[entry].ele3 = 56;
fills a single write buffer line, which is then written to memory in a
single transaction.
Doing
foo[entry].ele0 = 12;
foo[entry].ele1 = 23;
foo[entry].ele2 = 45;
if (read_uncached_something & 0x8888)
foo[entry].ele3 = 56;
Might trigger a read-modify-write of an incomplete memory or cache line,
followed by another R-M-W.
> Agreed. Another ideea would be to have separate routines for Boomerang and
> Cyclone/Tornado where no "if" is required. The driver grows a bit, but the
> cache impact is reduced.
Yes, this is critical path code, and it's relatively small.
> I assume that you mean the writting in data cache and not the cacheing of
> the code that writes. But there is still something that I don't
> understand: why are you so picky now about this when there are other
> sub-optimal things in the driver? Is a cache fault so much worse that I/O
> operations that we need anyway when dealing with H/W devices?
No. PCI I/O operations are easily the biggest impact. The PCI bus will
increasingly look like Ethernet, where you painfully wait as a long burst
passes. Even no-conflict reads take the time of 100s of instructions.
> I'm thinking here of the tbusy flag which has no special use AFAICS.
On many kernels versions it does.
The Tx-timeout code, based on calling the Tx routine with dev->tbusy set,
was originally a recovery hack based on the *bug* of potentially re-entering
the Tx routine on a timer-based retransmission. (It worked out very well,
because you were usually only retransmitting when a packet didn't get
through!)
Having the dev->tbusy lock later allowed better queue-level performance
because it only needed to do an approximate check of dev->tbusy, rather than
locking a longer code path itself.
> ...
> sync). If this flag is eliminated, the code becomes shorter in several
> places.
Yes, *every* driver is simplified.
More importantly it moves all queue and SMP locking to the queue layer.
Ideally, a driver for a clean device design won't need to know about SMP at
all, certainly not in the critical path.
Donald Becker becker@scyld.com
Scyld Computing Corporation
410 Severn Ave. Suite 210
Annapolis MD 21403