TCP Research
TCP Research
TCP Research
1500
1500
Bandwidth HMbsL
Bandwidth HMbsL
1250
1000 1000
750
zero-copy & checksum offloading
500 zero-copy 500
no optimizations
250
integrated copychecksum
1.5 4 8 16 24 32
MTU HKBL 1.5 4 8 16 24 32
MTU HKBL
Figure 2: TCP Bandwidth on the PowerEdge 4400 platform. The left-hand graph shows bandwidth when the applica-
tion process does not access the data. The right-hand graph shows the impact of this access on delivered bandwidth.
larger than the 4KB system page size. Moving from sum the data rather than touching it) because the sender
a 1500-byte to a 4KB MTU enables page remapping, stores to the data; on this Xeon P6-based platform, the
yielding a sudden jump in bandwidth. This illustrates the CPU can store to memory at only 215 MB/s, but it can
maximum benefit from page remapping or other zero- read at 504 MB/s. Of course these bandwidths are lower
copy networking enhancements on this platform. Band- when the CPU is competing with the I/O system for
width is lower for MTUs that are not a multiple of the memory bandwidth, as it is in this case.
page size because the socket API must copy the “odd” The lines labeled integrated copy/checksum show the
data; Figure 2 omits these points for clarity. effect of combining the conventional copy and checksum
Since checksums are computed in software for the into a single software loop, as implemented in Linux
zero-copy line, host CPU saturation again limits band- 2.3.99-pre8 kernels for this platform. Microbenchmark
width, this time at a higher peak of 1.5 Gb/s. The right- results confirm a bandwidth limitation of roughly 100
hand graph shows that the percentage improvement from MB/s for this loop, about half the 4400’s memory copy
copy avoidance is lower when netperf accesses its data, bandwidth of 207 MB/s. The integrated loop may ac-
yielding a peak of 820 Mb/s. This results from two fac- tually reduce bandwidth with large MTUs on this plat-
tors. First, the additional load drives the CPU to satu- form, but we have not investigated why. What is impor-
ration at a lower bandwidth, thus the CPU is spending tant is that the expected benefits of integrating the copy
less of its time copying data, diminishing the potential and checksum can be elusive unless the loop is carefully
benefit from avoiding the copy. Second, memory caches tuned for specific platforms. In the left-hand graph, the
create a synergy between copying and application pro- integrated copy/checksum appears to offer a benefit at
cessing; each may benefit from data left in cache by the 8KB MTUs, but this effect disappears if the application
other. Despite these factors, avoiding the copy yields touches the data.
a 26% bandwidth improvement at the 8KB MTU. We
omitted the lines for checksum offloading with copy be-
cause the results are similar to zero-copy, although the
4.2 Effect of Larger MTUs
benefit is slightly lower. Moving beyond 4KB along either x-axis in Figure 2
The lines labeled zero-copy & checksum offloading shows the effect of reducing total per-packet overhead
show the combined effect of page remapping and use by increasing the MTU. Larger MTUs mean fewer pack-
of checksum hardware. In the left-hand graph the host ets, and hence fewer interrupts and less protocol process-
CPUs never touch the data. Bandwidth jumps to 2 Gb/s ing overhead. Varying MTU size also reveals the likely
at the 4KB MTU as page remapping replaces copying. benefits of other approaches to reducing per-packet over-
At this point the host CPUs are near saturation from per- heads, e.g., interrupt coalescing or TCP offloading.
packet overheads including buffer management, inter- The data clearly shows a “sweet spot” at the 8KB
rupt handling, and TCP/IP protocol costs. In the right- MTU, which approximates the Jumbo Frames proposal
hand graph netperf touches the data, yielding a much for Ethernet. Standard Ethernet 1500-byte MTUs never
lower peak bandwidth of 1.18 Gb/s for zero-copy & yield a bandwidth above 740 Mb/s in these experiments.
checksum offloading at a 32KB MTU. This is somewhat In addition to reducing per-packet overheads, larger
slower than the left-hand zero-copy points (which check- MTUs enable page remapping optimizations for sockets
on the receiver, potentially reducing per-byte overheads. remapping), network buffer management, the TCP/IP
On this platform, bandwidth improvements from in- protocol stack, and the Trapeze network driver. The
creasing MTU size beyond 8KB are modest. This is due y -axis in Figure 3 gives the CPU utilization attributed
to two factors. For the bottom two lines on both graphs, to each overhead category. The x-axis gives six dif-
performance is dominated by per-byte overheads, mem- ferent configurations to show the effect of varying the
ory system bandwidth, and caching behaviors. In the MTU size and enabling page remapping and checksum
left-hand zero-copy & checksum offloading line, larger offloading. For these experiments TCP bandwidth was
MTUs do not improve bandwidth beyond 2 Gb/s be- held constant at 370 Mb/s by a slow sender, so it is possi-
cause the bottleneck shifts to the network interface CPU; ble to directly compare receiver CPU overheads for dif-
however, host CPU utilization at this speed drops to 45% ferent configurations.
with the 32KB MTU. Larger MTUs do yield improve- With a standard Ethernet MTU the Monet is near 95%
ments for the three remaining zero-copy cases, but this saturation. The CPU spends about 65% of its time pro-
effect diminishes with larger packet sizes as data move- cessing 30,000 packets per second through the network
ment overheads increasingly dominate. The effect of driver and TCP/IP stack. Total packet handling costs
larger MTUs is more apparent from a breakdown of how drop to 22% with an 8KB MTU, and to 10% with a
the CPU spends its time. 32KB MTU. Iprobe attributes approximately 20% of
CPU time to interrupt dispatching at a 1500-byte MTU,
100%
and 5% at an 8KB MTU; in each case it is 20% to 25%
Idle
Copy & Checksum of total overhead on the Alpha, showing that interrupt
Interrupt
VM
coalescing is an important optimization at higher band-
80% TPZ Driver
Buffering
widths, even with Jumbo Frames. It is interesting to
TCP/IP
note that actual TCP/IP protocol overhead accounts for
at most 12% of CPU time even with a 1500-byte MTU.
60% Protocol costs are significant but are not the limiting fac-
CPU Utilization