[tulip] specialised application - transmit behaviour I can't explain

Neil Gall neil_gall@agilent.com
Fri, 22 Sep 2000 12:40:53 +0100


Strictly speaking this is off-topic as I'm developing my own tulip driver
for a specialised transmit-only application (it's part of a network
simulation tool for test purposes) but since a few people here know the
tulip hardware and also since what I'm doing may be of general interest I
thought it might be worthwhile asking. My driver code is also somewhat
influenced by Donald Becker's driver, so there may be some common ground
there too. This explanation is quite long so either bear with me or hit
delete now!

My application is this: I need to transmit pre-determined traffic patterns
in close synchronisation on multiple ethernet ports. Each port is the only
transmitter on the segment it is connected to - the eventual purpose is to
plug directly with crossover cables into the system under test, which is a
network monitoring product. Currently I'm using a single PC running Linux
2.2.15-I1 with a D-Link quad channel server card. This needs to scale up in
size and performance, however, so I will eventually have multiple PCs each
with multiple ethernet ports. (Up to 16, maybe 32 synchronised links
eventually). I also have a digital I/O and timer card which is generating a
periodic interrupt (at 1kHz currently) - to scale the system up I will
physically send this hardware-generated heartbeat from one master PC to all
the others. The interrupts on each PC should then be closely synchronised.
Close enough for my purposes anyway.

I have implemented a character device driver which controls the digital I/O
card and all Tulip 21143s found in the PC. Frames are queued by write()
calls to this driver. During the heartbeat interrupt I then take the frames
designated for this "timeslot" and set up Tulip transmit  descriptors. As
each tx descriptor is set up I do a transmit poll to get the frames out as
quickly as possible. On the last frame I set the interrupt flag in the
transmit descriptor. The tulip interrupt then returns my frame buffers to a
pool, not unlike the standard driver doing dev_free_skbuff(). This does mean
that two interrupts are accessing the same data structures so I have tried
various mechanisms for mutual exclusion and settled on a fairly simple and
predictable synchronous style of operation where the heartbeat interrupt
locks a spinlock, does its work then gets out. The tulip interrupts then
hopefully have time to get their work done before the next heartbeat. This
interrupt also locks the spinlock, gets its work done and returns.

Okay, that's the context. Here's my problem: I basically have this working
and with a 128MB 733MHz PentiumIII I can load a 100baseTX link at up to 50
or 60% load (or four links at 10-15%) with every frame correctly falling in
its millisecond timeslot. (I capture the traffic on an Agilent Internet
Advisor, before you ask). I'm currently testing the driver at a constant
data rate (n fixed-length frames per tick) and I'm seeing strange behaviour
at the very start of operation.

When the system has been running for a while (a few seconds is enough), my
group of 105 byte frames in each tick appear regularly spaced, 10
microseconds apart. Occasionally (maybe 5-10% of the time) I see a gap of
around 30 or 40 microseconds, which I've yet to explain but it's not
worrying me too much. However, during the first few ticks (i.e.
milliseconds, or a few tens of frames) of operation, these 30-40 microsecond
gaps are much more frequent and sometimes larger, meaning I often "run out
of time" during the heartbeat interrupt (I monitor the clock counters on the
digital I/O board and abandon the tick if it's getting tight). The result is
that a few frames are pushed back into the next timeslot, but after a few
ticks everything speeds up, the system catches up with itself and runs
perfectly normally for hours afterwards.

Can anyone explain this behaviour? I guess it's possibly related to the
occasional longer gaps, so any explanation of those would also be useful.

For reference
   - the Rx process is not running
   - autosensing and autonegotiation are turned off
   - tx jabber timeout is turned off
   - the transmitter is in store-and-forward mode
   - all frames are complete in one tx descriptor, just like the
     standard driver
   - I do gather statistics and as expected am seeing no collisions,
     deferred frames or link errors. These are not recorded by the
     Internet Advisor either, so I'm confident there are none.

Sorry that this has been long-winded. Any help at all will be appreciated.
Even pointers to where I could look would be useful as I'm at a bit of a
loss to explain what's going on. Thanks.

--
Neil Gall   neil_gall@agilent.com   +44 131 331 7112
Agilent Technologies UK Ltd.    fax +44 131 331 7423
Telecom Systems Division     mobile +44 771 518 2371
Scotstoun Ave., South Queensferry EH30 9TG, Scotland