[Beowulf] Pricing and Trading Networks: Down is Up, Left is Right

Thu Feb 16 07:26:53 PST 2012

http://www.fragmentationneeded.net/2011/12/pricing-and-trading-networks-down-is-up.html

Pricing and Trading Networks: Down is Up, Left is Right

My introduction to enterprise networking was a little backward. I started out
supporting trading floors, backend pricing systems, low-latency algorithmic
trading systems, etc... I got there because I'd been responsible for UNIX
systems producing and consuming multicast data at several large financial
firms.

Inevitably, the firm's network admin folks weren't up to speed on matters of
performance tuning, multicast configuration and QoS, so that's where I
focused my attention. One of these firms offered me a job with the word
"network" in the title, and I was off to the races.

It amazes me how little I knew in those days. I was doing PIM and MSDP
designs before the phrases "link state" and "distance vector" were in my
vocabulary! I had no idea what was populating the unicast routing table of my
switches, but I knew that the table was populated, and I knew what PIM was
going to do with that data.

More incredible is how my ignorance of "normal" ways of doing things (AVVID,
SONA, Cisco Enterprise Architecture, multi-tier designs, etc...) gave me an
advantage over folks who had been properly indoctrinated. My designs worked
well for these applications, but looked crazy to the rest of the network
staff (whose underperforming traditional designs I was replacing).

The trading floor is a weird place, with funny requirements. In this post I'm
going to go over some of the things that make trading floor networking...
Interesting.

Redundant Application Flows

The first thing to know about pricing systems is that you generally have two
copies of any pricing data flowing through the environment at any time.
Ideally, these two sets originate from different head-end systems, get
transit from different wide area service providers, ride different physical
infrastructure into opposite sides of your data center, and terminate on
different NICs in the receiving servers.

If you're getting data directly from an exchange, that data will probably be
arriving as multicast flows. Redundant multicast flows. The same data arrives
at your edge from two different sources, using two different multicast
groups.

If you're buying data from a value-add aggregator (Reuters, Bloomberg,
etc...), then it probably arrives via TCP from at least two different
sources. The data may be duplicate copies (redundancy), or be distributed
among the flows with an N+1 load-sharing scheme.

Losing One Packet Is Bad

Most application flows have no problem with packet loss. High performance
trading systems are not in this category.

Think of the state of the pricing data like a spreadsheet. The rows
represents a securities -- something that traders buy and sell. The columns
represent attributes of that security: bid price, ask price, daily high and
low, last trade price, last trade exchange, etc...

Our spreadsheet has around 100 columns and 200,000 rows. That's 20 million
cells. Every message that rolls in from a multicast feed updates one of those
cells. You just lost a packet. Which cell is wrong? Easy answer: All of them.
If a trader can't trust his data, he can't trade.

These applications have repair mechanisms, but they're generally slow and/or
clunky. Some of them even involve touch tone. Really:

    The Securities Industry Automation Corporation (SIAC) provides a
retransmission capability for the output data from host systems.  As part of
this service, SIAC provides the AutoLink facility to assist vendors with
requesting retransmissions by submitting requests over a touch-tone telephone
set

Reconvergence Is Bad

Because we've got two copies of the data coming in. There's no reason to fix
a single failure. If something breaks, you can let it stay broken until the
end of the day.

What's that? You think it's worth fixing things with a dynamic routing
protocol? Okay cool, route around the problem. Just so long as you can
guarantee that "flow A" and "flow B" never traverse the same core router. Why
am I paying for two copies of this data if you're going to push it through a
single device? You just told me that the device is so fragile that you feel
compelled to route around failures!

Don't Cluster the Firewalls

The same reason we don't let routing reconverge applies here. If there are
two pricing firewalls, don't tell them about each other. Run them as
standalone units. Put them in separate rooms, even.  We can afford to lose
half of a redundant feed. We cannot afford to lose both feeds, even for the
few milliseconds required for the standby firewall take over. Two clusters
(four firewalls) would be okay, just keep the "A" and "B" feeds separate!

Don't team the server NICs

The flow-splitting logic applies all the way down to the servers. If they've
got two NICs available for incoming pricing data, these NICs should be
dedicated per-flow. Even if there are NICs-a-plenty, the teaming schemes are
all bad news because like flows, application components are also disposable.
It's okay to lose one. Getting one back? That's sometimes worse. Keep
reading...

Recovery Can Kill You

Most of these pricing systems include a mechanism for data receivers to
request retransmission of lost data, but the recovery can be a problem. With
few exceptions, the network applications in use on the trading floor don't do
any sort of flow control. It's like they're trying to hurt you.

Imagine a university lecture where a sleeping student wakes up, asks the
lecturer to repeat the last 30 minutes, and the lecturer complies. That's
kind of how these systems work.

Except that the lecturer complies at wire speed, and the whole lecture hall
full of students is compelled to continue taking notes. Why should the every
other receiver be penalized because one system screwed up? I've got trades to
clear!

The following snapshot is from the Cisco CVD for trading systems. it shows
how aggressive these systems can be. A nominal 5Mb/s trading application
regularly hits wire-speed (100Mb/s) in this case.

The graph shows a small network when things are working right. A big trading
backend at a large financial services firm can easily push that green line
into the multi-gigabit range. Make things interesting by breaking stuff and
you'll over-run even your best 10Gb/s switch buffers (6716 cards have 90MB
per port) easily.

Slow Servers Are Good

Lots of networks run with clients deliberately connected at slower speeds
than their server. Maybe you have 10/100 ports in the wiring closet and
gigabit-attached servers. Pricing networks require exactly the opposite. The
lecturer in my analogy isn't just a single lecturer. It's a team of
lecturers. They all go into wire-speed mode when the sleeping student wakes
up.

How will you deliver multiple simultaneous gigabit-ish multicast streams to
your access ports? You can't. I've fixed more than one trading system by
setting server interfaces down to 100Mb/s or even 10Mb/s. Fast clients, slow
servers is where you want to be.

Slowing down the servers can turn N*1Gb/s worth of data into N*100Mb/s --
something we can actually handle.

Bad Apple Syndrome

The sleeping student example is actually pretty common. It's amazing to see
the impact that can arise from things like:

    a clock update on a workstation

    ripping a CD with iTunes

    briefly closing the lid on a laptop

The trading floor is usually a population of Windows machines with users
sitting behind them. Keeping these things from killing each other is a
daunting task. One bad apple will truly spoil the bunch.

How Fast Is It?

System performance is usually measured in terms of stuff per interval. That's
meaningless on the trading floor. The opening bell at NYSE is like turning on
a fire hose. The only metric that matters is the answer to this question: Did
you spill even one drop of water?

How close were you to the limit? Will you make it through tomorrow's trading
day too?

I read on twitter that Ben Bernanke got a bad piece of fish for dinner. How
confident are you now? Performance of these systems is binary. You either
survived or you did not. There is no "system is running slow" in this world.

Routing Is Upside Down

While not unique to trading floors, we do lots of multicast here. Multicast
is funny because it relies on routing traffic away from the source, rather
than routing it toward the destination. Getting into and staying in this
mindset can be a challenge. I started out with no idea how routing worked, so
had no problem getting into the multicast mindset :-)

NACK not ACK

Almost every network protocol relies on data receivers ACKnowledging their
receipt of data. But not here. Pricing systems only speak up when something
goes missing.

QoS Isn't The Answer

QoS might seem like the answer to make sure that we get through the day
smoothly, but it's not. In fact, it can be counterproductive.

QoS is about managed un-fairness... Choosing which packets to drop. But
pricing systems are usually deployed on dedicated systems with dedicated
switches. Every packet is critical, and there's probably more of them than we
can handle. There's nothing we can drop.

Making matters worse, enabling QoS on many switching platforms reduces the
buffers available to our critical pricing flows, because the buffers
necessarily get carved so that they can be allocated to different kinds of
traffic. It's counter intuitive, but 'no mls qos' is sometimes the right
thing to do.

Load Balancing Ain't All It's Cracked Up To Be

By default, CEF doesn't load balance multicast flows. CEF load balancing of
multicast can be enabled and enhanced, but doesn't happen out of the box.

We can get screwed on EtherChannel links too: Sometimes these quirky
applications intermingle unicast data with the multicast stream. Perhaps a
latecomer to the trading floor wants to start watching Cisco's stock price.
Before he can begin, he needs all 100 cells associated with CSCO. This is
sometimes called the "Initial Image." He ignores updates for CSCO until he's
got the that starting point loaded up.

CSCO has updated 9000 times today, so the server unicasts the initial image:
"Here are all 100 cells for CSCO as of update #9000: blah blah blah...". Then
the price changes, and the server multicasts update #9001 to all receivers.

If there's a load balanced path (either CEF or an aggregate link) between the
server and client, then our new client could get update 9001 (multicast)
before the initial image (unicast) shows up. The client will discard update
9001 because he's expecting a full record, not an update to a single cell.

Next, the initial image shows up, and the client knows he's got everything
through update #9000. Then update #9002 arrives. Hey, what happened to #9001?

Post-mortem analysis of these kinds of incidents will boil down to the
software folks saying:

    We put the messages on the wire in the correct order. They were delivered
by the network in the wrong order.

ARP Times Out

NACK-based applications sit quietly until there's a problem. So quietly that
they might forget the hardware address associated with their gateway or with
a neighbor.

No problem, right? ARP will figure it out... Eventually. Because these are
generally UDP-based applications without flow control, the system doesn't
fire off a single packet, then sit and wait like it might when talking TCP.
No, these systems can suddenly kick off a whole bunch of UDP datagrams
destined for a system it hasn't talked to in hours.

The lower layers in the IP stack need to hold onto these packets until the
ARP resolution process is complete. But the packets keep rolling down the
stack! The outstanding ARP queue is only 1 packet deep in many
implementations. The queue overflows and data is lost. It's not strictly a
network problem, but don't worry. Your phone will ring.

Losing Data Causes You to Lose Data

There's a nasty failure mode underlying the NACK-based scheme. Lost data will
be retransmitted. If you couldn't handle the data flow the first time around,
why expect to handle wire speed retransmission of that data on top of the
data that's coming in the next instant?

If the data loss was caused by a Bad Apple receiver, then all his peers
suffer the consequences. You may have many bad apples in a moment. One Bad
Apple will spoil the bunch.

If the data loss was caused by an overloaded network component, then you're
rewarded by compounding increases in packet rate. The exchanges don't stop
trading, and the data sources have a large queue of data to re-send.

TCP applications slow down in the face of congestion. Pricing applications
speed up.

Packet Decodes Aren't Available

Some of the wire formats you'll be dealing with are closed-source secrets.
Others are published standards for which no WireShark decodes are publicly
available. Either way, you're pretty much on your own when it comes to
analysis.

Updates

Responding to Will's question about data sources: The streams come from the
various exchanges (NASDAQ, NYSE, FTSE, etc...) Because each of these
exchanges use their own data format, there's usually some layers of
processing required to get them into a common format for application
consumption. This processing can happen at a value-add data distributor
(Reuters, Bloomberg, Activ), or it can be done in-house by the end user.
Local processing has the advantage of lower latency because you don't have to
have the data shipped from the exchange to a middleman before you see it.

Other streams come from application components within the company. There are
usually some layers of processing (between 2 and 12) between a pricing update
first hitting your equipment, and when that update is consumed by a trader.
The processing can include format changes, addition of custom fields, delay
engines (delayed data can be given away for free), vendor-switch systems (I
don't trust data vendor "A", switch me to "B"), etc...

Most of those layers are going to be multicast, and they're going to be the
really dangerous ones, because the sources can clobber you with LAN speeds,
rather than WAN speeds.

As far as getting the data goes, you can move your servers into the
exchange's facility for low-latency access (some exchanges actually provision
the same length of fiber to each colocated customer, so that nobody can claim
a latency disadvantage), you can provision your own point-to-point circuit
for data access, you can buy a fat local loop from a financial network
provider like BT/Radianz (probably MPLS on the back end so that one local
loop can get you to all your pricing and clearing partners), or you can buy
the data from a value-add aggregator like Reuters or Bloomberg.

Responding to Will's question about SSM:  I've never seen an SSM pricing
component. They may be out there, but they might not be a super good fit.
Here's why: Everything in these setups is redundant, all the way down to
software components. It's redundant in ways we're not used to seeing in
enterprises. No load-balancer required here. The software components
collaborate and share workload dynamically. If one ticker plant fails, his
partner knows what update was successfully transmitted by the dead peer, and
takes over from that point. Consuming systems don't know who the servers are,
and don't care. A server could be replaced at any moment.

In fact, it's not just downstream pricing data that's multicast. Many of
these systems use a model where the clients don't know who the data sources
are. Instead of sending requests to a server, they multicast their requests
for data, and the servers multicast the replies back. Instead of:

    <handshake> hello server, nice to meet you. I'd like such-and-such.

it's actually:

    hello? servers? I'd like such-and-such! I'm ready, so go ahead and send
it whenever...

Not knowing who your server is kind of runs counter to the SSM ideal. It
could be done with a pool of servers, I've just never seen it.

The exchanges are particularly slow-moving when it comes to changing things.
The modern exchange feed, particularly ones like the "touch tone" example I
cited are literally ticker-tape punch signals wrapped up in an IP multicast
header.

The old school scheme was to have a ticker tape machine hooked to a "line"
from the exchange.  Maybe you'd have two of them (A and B again). There would
be a third one for retransmit. Ticker machine run out of paper? Call the
exchange, and here's more-or-less what happens:

    Cut the chunk of paper containing the updates you missed out of their
spool of tape.  Scissors are involved here.

    Grab a bit of header tape that says: "this is retransmit data for XYZ
Bank".

    Tape these two pieces of paper together, and feed them through a reader
that's attached to the "retransmit line"

    Every bank in New York will get the retransmits, but they'll know to
ignore them.

    XYZ Bank clips the retransmit data out of the retransmit ticker machine,
and pastes it into place on the end where the machine ran out of paper.

These terms "tick" "line" and "retransmit", etc... all still apply with
modern IP based systems. I've read the developer guides for these systems (to
write wireshark decodes), and it's like a trip back in time. Some of these
systems are still so closely coupled to the paper-punch system that you get
chads all over the floor and paper cuts all over your hands just from reading
the API guide :-)