[Beowulf] Performance tuning for Jumbo Frames
patrick at myri.com
Sat Dec 12 08:40:49 PST 2009
Rahul Nabar wrote:
> I have seen a considerable performance boost for my codes by using
> Jumbo Frames. But are there any systematic tools or strategies to
> select the optimum MTU size?
There is no optimal MTU size. This is the maximum payload you can fit in
one packet, so there is no drawback to a bigger MTU. Actually, there is
one in terms of wormhole switching, but switch contention is an issue
happily ignored by most HPC users.
> external world required of the interfaces) Have you guys found
> performance to be MTU sensitive?
A large MTU means fewer packets for the same amount of data transfered.
In all stack processing, there is a per-packet overhead (decoding
header, integrity, sequence number, etc) and a per-byte overhead (copy).
A large MTU reduces the total per-packet overhead because there are less
packets to process.
Most 10GE NIC have no problems reaching line rate at 1500 Bytes (the
standard Ethernet MTU), the problem is the host OS stack (mainly TCP)
where the per-packet overhead is important. One trick that all 10GE NICs
worth their salt are doing these days is to fake a large MTU at the OS
level, while keeping the wire MTU at 1500 Bytes (for compatibility).
This is called TSO (Transmit Send Offload) and LRO (Large Receive
Offload). The OS stack is using a virtual MTU of 64K and the NIC does
segmentation/reassembly in hardware, sort of.
> Also, are there any switch side parameters that can affect the
> performance of HPC codes? Specifically I was trying to run VASP which
> is known to be latency sensitive.
A large MTU has little to no impact on latency.
> I have a 10 Gig E network with a
> RDMA offload card and am getting average latencies (ping pong) using
> rping of around 14 microsecs in the MPI tests.
It is most likely due to the switch. Try back-to-back to measure without
it. I don't know what hardware you are using, but you can get close to
10us latency over TCP with a standard 10GE NIC and interrupt coalescing
disabled. With a NIC supporting OS-bypass (RDMA only make sense for
bandwidth), you should get at least half that, ideally below 3us.
More information about the Beowulf