[Beowulf] Help with inconsistent network performance

Wed Dec 19 00:26:46 PST 2007

Greg Lindahl wrote:
> On Tue, Dec 18, 2007 at 09:05:41PM -0500, Patrick Geoffray wrote:
> 
>> No, it just means the NIC supports it.
> 
> Well, then how about ethtool -S? That looks like an actual count of
> flow control events, so rx flow control events means the switch
> must support it in some fashion.

If this counter is not null, then you can say the switch does support RX 
flow control, which is the most important. However, the NIC driver may 
not report these events to ethtool, and you eventually need to generate 
some contention in the switch. A simple test is to run a simple MPI code 
where several senders streams to a single receiver. If you see a 
cumulated bandwidth equal to the receiver link bandwidth, then flow 
control works. If you see that all senders have the same bandwidth, then 
the switch is fair on top of that.

> Well, we know it can be done perfectly, it's done in InfiniBand
> switches, and that other 10 gig non-ethernet switch, what's it called?
> Oh yeah, Myrinet. They do it, too.

In Ethernet, the sender has to finish sending the current packet before 
  stopping, so your switch buffers should be able to store a full frame 
in addition to the wire delay. In Myrinet (and I presume in IB), the 
hardware flow control can stop a sender in the middle of a packet, so 
you only have to buffered the wire delay. It's 4 KB per port versus 12 
to 16 KB per port. Not trivial and some corners may be cut to save 
space/money in the switch chips.

>> Flow-control is not for everyone, and that's why it is often turned off 
>> by default. When a sender is paused, it will stop sending anything, 
>> including packets for different destinations. Dropping packets is 
>> expensive to recover but it keeps things moving.
> 
> Can Myrinet even disable flow control? Odd that Ethrernet is any
> different; dropping any packets is an utter disaster for TCP.

I think it's technically possible to disable flow control in the switch 
crossbars in Myrinet, but you would not want to. The NICs can change 
routes quickly when they sense contention on a specific path (Quadrics 
does the same thing, others can't). That helps a lot for internal hot 
spots that are frequent in HPC, but it does nothing against the N->1 
communication pattern of death. As Mark pointed out, the best way around 
it is to not have it in the first place.

Ethernet switches are often used in more hostile environments where you 
can not prevent such N->1 traffic: I could flood a particular machine on 
a campus from a couple of host to produce contention, that would 
saturate some internal links in the switch that would propagate the 
contention to other ports, more links are blocked, etc. If you can 
sustain the contention a few seconds on a busy switch, then you can 
block the whole thing, complete meltdown.

That's why high-end switch/routers are super expensive, they are way 
over-dimensioned inside to be able to handle contentions. That's also 
why the FCoE folks are pushing for per-priority flow-control in 
Ethernet, so that untrusted/misbehaving traffic can be dropped to not 
affect trusted/important FCoE traffic that should not be dropped. And 
that's why switch flow-control is turned off by default most of the time.

Patrick