LinuxConJapan2016 Makita 160712
LinuxConJapan2016 Makita 160712
LinuxConJapan2016 Makita 160712
Performance
Toshiaki Makita
NTT Open Source Software Center
• Background
*1 shortest ethernet frame size 64bytes + preamble+IFG 20bytes = 84 bytes = 672 bits
10,000,000,000 / 672 = 14,880,952 Copyright © 2016 NTT Corp. All Rights Reserved. 6
How many transactions to handle?
*1 100 bytes + IP/UDP/Ether headers 46bytes + preamble+IFG 20bytes = 166 bytes = 1328 bits
10,000,000,000 / 1328 = 7,530,120 Copyright © 2016 NTT Corp. All Rights Reserved. 7
Basic technologies for
network performance
(not only for UDP)
:<
• UDP has explicit boundary between datagrams
• Cannot segment /aggregate packets
TSO/GSO
GRO
Tx Server (segmentation)
(aggregation) Rx Server
TCP
byte stream Great performance gain!
UDP
Not applicable
datagram
MTU
*1 TCP in UDP tunneling (e.g. VXLAN) is OK as well
size *2 Other than UFO, which is rarely implemented on physical NICs
Copyright © 2016 NTT Corp. All Rights Reserved. 9
Basic technologies for network performance
• RSS
• Scale network Rx processing in multi-core server
• RSS itself is a NIC feature
• Distribute packets to multi-queue in a NIC
• Each queue has a different interrupt vector
(Packets on each queue can be processed by different core)
• Applicable to TCP/UDP
:)
• Common 10G NICs have RSS
NIC
RSS
Core rx queue
interrupt packets
Core rx queue
echo back
*1 create as many threads as core num
each thread just calls recvfrom() and sendto() UDP
*2 There is only 1 client (IP address). socket
To spread UDP traffic on the NIC, RSS is configured
to see UDP port numbers. This setting is not bulk 100bytes UDP packets*1
needed for common UDP servers. Copyright © 2016 NTT Corp. All Rights Reserved. 11
How to improve this?
• smp_affinity *1
$ for ((irq=105; irq<=124; irq++)); do
> cat /proc/irq/$irq/smp_affinity
> done
01000 -> 12 -> Node 0 04000 -> 14 -> Node 0
00800 -> 11 -> Node 0 00001 -> 0 -> Node 0
00400 -> 10 -> Node 0 02000 -> 13 -> Node 0
00400 -> 10 -> Node 0 01000 -> 12 -> Node 0
01000 -> 12 -> Node 0 00008 -> 3 -> Node 0
04000 -> 14 -> Node 0 00800 -> 11 -> Node 0
00400 -> 10 -> Node 0 00800 -> 11 -> Node 0
00010 -> 4 -> Node 0 04000 -> 14 -> Node 0
00004 -> 2 -> Node 0 00800 -> 11 -> Node 0
02000 -> 13 -> Node 0 02000 -> 13 -> Node 0
*1 irq number can be obtained from /proc/interrupts Copyright © 2016 NTT Corp. All Rights Reserved. 15
Check affinity_hint
• ethtool -S *1
$ ethtool -S ens1f0 | grep 'rx_queue_.*_packets'
rx_queue_0_packets: 198005155
rx_queue_1_packets: 153339750
rx_queue_2_packets: 162870095
rx_queue_3_packets: 172303801
rx_queue_4_packets: 153728776
rx_queue_5_packets: 158138563
rx_queue_6_packets: 164411653
rx_queue_7_packets: 165924489
rx_queue_8_packets: 176545406
rx_queue_9_packets: 165340188
rx_queue_10_packets: 150279834
rx_queue_11_packets: 150983782
rx_queue_12_packets: 157623687
rx_queue_13_packets: 150743910
rx_queue_14_packets: 158634344
rx_queue_15_packets: 158497890
rx_queue_16_packets: 4
rx_queue_17_packets: 3
rx_queue_18_packets: 0
rx_queue_19_packets: 8
• Performance change
• Before: 270,000 tps (approx. 360Mbps)
• After: 17,000 tps (approx. 23Mbps)
• Got worse...
• perf
• Profiling tool developed in kernel tree
• Identify hot spots by sampling CPU cycles
• FlameGraph
• Visualize perf.data in svg format
• https://github.com /brendangregg /FlameGraph
• FlameGraph of CPU0 *1
• x-axis (width): CPU consumption
• y-axis (height): Depth of call stack
queued_spin_lock_slowpath
udp_queue_rcv_skb
softirq softirq
backlog
RPS Core NIC
UDP
socket
softirq
lock backlog
Core rx queue
contention!! interrupt packets
softirq rx queue
backlog
Core
• after
userspace
thread
Interrupt processing (irq) (sys, user)
• Perfomance change
• RSS: 270,000 tps (approx. 360Mbps)
• +affinity_hint+RPS: 17,000 tps (approx. 23Mbps)
• +SO_REUSEPORT: 2,540,000 tps (approx. 3370Mbps)
• Great improvement!
• but...
• More analysis
queued_spin_lock_slowpath
*1 BPF allows much more flexible logic but this time only cpu
number is used
Copyright © 2016 NTT Corp. All Rights Reserved. 34
Use SO_ATTACH_REUSEPORT_EPBF
• before
userspace
thread
Interrupt processing (irq) (sys, user)
• after
userspace
Interrupt processing thread
(irq) (sys, user)
• Perfomance change
• RSS: 270,000 tps (approx. 360Mbps)
• +affinity_hint+RPS: 17,000 tps (approx. 23Mbps)
• +SO_REUSEPORT: 2,540,000 tps (approx. 3370Mbps)
• +SO_ATTACH_...: 4,250,000 tps (approx. 5640Mbps)
• Pin userspace thread on the same core for better cache affinity
• cgroup, taskset, pthread_setaffinity_np(), ... any way you like
pin userspace kernel
softirq
UDP softirq
thread0
socket0 backlog
RPS Core0 NIC
• Perfomance change
• RSS: 270,000 tps (approx. 360Mbps)
• +affinity_hint+RPS: 17,000 tps (approx. 23Mbps)
• +SO_REUSEPORT: 2,540,000 tps (approx. 3370Mbps)
• +SO_ATTACH_...: 4,250,000 tps (approx. 5640Mbps)
• +Pin threads: 5,050,000 tps (approx. 6710Mbps)
userspace kernel
NIC
Core thread Qdisc queue tx queue
tx queue
Core thread Qdisc queue
lockqueue
Core thread Qdisc tx queue
contention!!
tx queue
Core thread Qdisc queue
tx queue
Core thread Qdisc queue
• Try disabling it
# for ((txq=0; txq<20; txq++)); do
> echo 0 > /sys/class/net/ens1f0/queues/tx-$txq/xps_cpus
> done
queued_spin_lock_slowpath
(lock contention)
Interrupt
processing userspace
(irq) thread Tx
userspace
thread Rx
Copyright © 2016 NTT Corp. All Rights Reserved. 45
Enable XPS
dev_gro_receive
• WARNING:
• Don't disable it if TCP performance matters
• Disabling GRO makes TCP rx throughput miserably low
• Don't disable it on KVM hypervisors as well
• GRO boost throughput of tunneling protocol traffic as well
as guest's TCP traffic on hypervisors
• Perfomance change
• RSS (+XPS): 270,000 tps (approx. 360Mbps)
• +affinity_hint+RPS: 17,000 tps (approx. 23Mbps)
• +SO_REUSEPORT: 2,540,000 tps (approx. 3370Mbps)
• +SO_ATTACH_...: 4,250,000 tps (approx. 5640Mbps)
• +Pin threads: 5,050,000 tps (approx. 6710Mbps)
• +Disable GRO: 5,180,000 tps (approx. 6880Mbps)
nf_iterate
nf_iterate
• Perfomance change
• RSS (+XPS): 270,000 tps (approx. 360Mbps)
• +affinity_hint+RPS: 17,000 tps (approx. 23Mbps)
• +SO_REUSEPORT: 2,540,000 tps (approx. 3370Mbps)
• +SO_ATTACH_...: 4,250,000 tps (approx. 5640Mbps)
• +Pin threads: 5,050,000 tps (approx. 6710Mbps)
• +Disable GRO: 5,180,000 tps (approx. 6880Mbps)
• +Unload iptables: 5,380,000 tps (approx. 7140Mbps)
fib_table_lookup
fib_table_lookup
• Perfomance change
• RSS (+XPS): 270,000 tps (approx. 360Mbps)
• +affinity_hint+RPS: 17,000 tps (approx. 23Mbps)
• +SO_REUSEPORT: 2,540,000 tps (approx. 3370Mbps)
• +SO_ATTACH_...: 4,250,000 tps (approx. 5640Mbps)
• +Pin threads: 5,050,000 tps (approx. 6710Mbps)
• +Disable GRO: 5,180,000 tps (approx. 6880Mbps)
• +Unload iptables: 5,380,000 tps (approx. 7140Mbps)
• +Disable validation: 5,490,000 tps (approx. 7290Mbps)
• Perfomance change
• RSS (+XPS): 270,000 tps (approx. 360Mbps)
• +affinity_hint+RPS: 17,000 tps (approx. 23Mbps)
• +SO_REUSEPORT: 2,540,000 tps (approx. 3370Mbps)
• +SO_ATTACH_...: 4,250,000 tps (approx. 5640Mbps)
• +Pin threads: 5,050,000 tps (approx. 6710Mbps)
• +Disable GRO: 5,180,000 tps (approx. 6880Mbps)
• +Unload iptables: 5,380,000 tps (approx. 7140Mbps)
• +Disable validation: 5,490,000 tps (approx. 7290Mbps)
• +Disable audit: 5,860,000 tps (approx. 7780Mbps)
__ip_select_ident
• Perfomance change
• RSS (+XPS): 270,000 tps (approx. 360Mbps)
• +affinity_hint+RPS: 17,000 tps (approx. 23Mbps)
• +SO_REUSEPORT: 2,540,000 tps (approx. 3370Mbps)
• +SO_ATTACH_...: 4,250,000 tps (approx. 5640Mbps)
• +Pin threads: 5,050,000 tps (approx. 6710Mbps)
• +Disable GRO: 5,180,000 tps (approx. 6880Mbps)
• +Unload iptables: 5,380,000 tps (approx. 7140Mbps)
• +Disable validation: 5,490,000 tps (approx. 7290Mbps)
• +Disable audit: 5,860,000 tps (approx. 7780Mbps)
• +Skip ID calculation: 6,010,000 tps (approx. 7980Mbps)
• Perfomance change
• RSS (+XPS): 270,000 tps (approx. 360Mbps)
• +affinity_hint+RPS: 17,000 tps (approx. 23Mbps)
• +SO_REUSEPORT: 2,540,000 tps (approx. 3370Mbps)
• +SO_ATTACH_...: 4,250,000 tps (approx. 5640Mbps)
• +Pin threads: 5,050,000 tps (approx. 6710Mbps)
• +Disable GRO: 5,180,000 tps (approx. 6880Mbps)
• +Unload iptables: 5,380,000 tps (approx. 7140Mbps)
• +Disable validation: 5,490,000 tps (approx. 7290Mbps)
• +Disable audit: 5,860,000 tps (approx. 7780Mbps)
• +Skip ID calculation: 6,010,000 tps (approx. 7980Mbps)
• +Hyper threading: 7,010,000 tps (approx. 9310Mbps)
_raw_spin_lock
kmem/kmalloc/kfree
• Virtualization
• UDP servers as guests
• Hypervisor can saturate CPUs or drop packets
• We are going to investigate ways to boost performance
in virtualized environment as well
• OS settings
• Use RPS if rx-queues are not enough
• Make sure XPS is configured
• Consider other tunings to reduce per -core overhead
• Disable GRO
• Unload iptables
• Disable source IP validation
• Disable auditd
• Hardware
• Use NICs which have enough RSS rx-queues if possible
(as many queues as core num)