Fast Userspace OVS with AF_XDP, OVS CONF 2018

Fast Userspace OVS with
AF_XDP
OVS Conference 2018
William Tu, VMware Inc

Outline
• AF_XDP Introduction
• OVS AF_XDP netdev
• Performance Optimizations

Linux AF_XDP
• A new socket type that receives/sends raw
frames with high speed
• Use XDP (eXpress Data Path) program to
trigger receive
• Userspace program manages Rx/Tx ring and
Fill/Completion ring.
• Zero Copy from DMA buffer to user space
memory with driver support
• Ingress/egress performance > 20Mpps [1]
3
From “DPDK PMD for AF_XDP”, Zhang Qi
[1] The Path to DPDK Speeds for AF XDP, Linux Plumber 2018

OVS-AF_XDP Netdev
ovs-vswitchd
Goal
• Use AF_XDP socket as a fast
channel to usersapce OVS
datapath, dpif-netdev
• Flow processing happens in
userspace
4
Network Stacks
Hardware
User space
Driver +
XDP
Userspace
DatapathAF_XDP
socket
Kernel
high speed channel

OVS-AF_XDP Architecture
5
Existing
• netdev: abstraction layer for network device
• dpif: datapath interface
• dpif-netdev: userspace implementation of OVS
datapath
New
• Kernel: XDP program and eBPF map
• AF_XDP netdev: implementation of afxdp device
ovs/Documentation/topics/porting.rst

OVS AF_XDP Configuration
#./configure
# make && maskinstall
# make check-afxdp
# ovs-vsctladd-br br0 --
set Bridge br0 datapath_type=netdev
# ovs-vsctladd-port br0 eth0 --
set int enp2s0 type="afxdp”
Based on v3 patch: [ovs-dev] [PATCHv3 RFC 0/3] AF_XDP netdev support for OVS

Prototype Evaluation
• Sender sends 64Byte, 20Mpps to one port, measure the
receiving packet rate at the other port
• Measure single flow, single core performance with Linux
kernel 4.19-rc3 and OVS master
• Enable AF_XDP Zero Copy mode
• Performance goal: 20Mpps rxdop
16-core Intel Xeon
E5 2620 v3 2.4GHz
32GB memory
DPDK packet generator
Netronome
NFP-4000 + AF_XDP
Userspace Datapath
br0
ingress Egress
enp2s0
7
20Mpps
sender
Intel XL710
40GbE

Budget your
packet like
Budget your
money
Time Budget
To achieve 20Mpps
• Budget per packet: 50ns
• 2.4GHz CPU: 120 cycles per packet
Fact [1]
• Cache misses: 32ns, x86 LOCK prefix: 8.25ns
• System call with/wo SELinux auditing: 75ns / 42ns
Batch of 32 packets
• Budget per batch: 50ns x 32 = 1.5us
[1] Improving Linux networking performance, LWN, https://lwn.net/Articles/629155/, Jesper Brouer

Optimization 1/5
• OVS pmd (Poll-Mode Driver) netdev for rx/tx
• Before: call poll() syscall and wait for new I/O
• After: dedicated thread to busy polling the Rx ring
• Effect: avoid system call overhead
9
+const struct netdev_class netdev_afxdp_class = {
+ NETDEV_LINUX_CLASS_COMMON,
+ .type= "afxdp",
+ .is_pmd = true,
.construct = netdev_linux_construct,
.get_stats = netdev_internal_get_stats,

Optimization 2/5
• Packet metadata pre-allocation
• Before: allocate md when receives packets
• After: pre-allocate md and initialize it
• Effect:
• Reduce number of per-packet operations
• Reduce cache misses
10
Multiple 2KB umem chunk memory region
storing packet data
Packet metadata in continuous memory region
(struct dp_packet)
One-to-one maps to AF_XDP umem

Optimizations 3-5
• Packet data memory pool for AF_XDP
• Fast data structure to GET and PUT free memory chunk
• Effect: Reduce cache misses
• Dedicated packet data pool per-device queue
• Effect: Consume more memory but avoid mutex lock
• Batching sendmsg system call
• Effect: Reduce system call rate
11
Reference: Bringing the Power of eBPF to Open vSwitch, Linux Plumber 2018

OVS AF_XDP RX drop
# ovs-ofctl add-flow br0
"in_port=enp2s0, actions=drop"
# ovs-appctl pmd-stats-show
OVS AF_XDP
br0
enp2s0
DROP

pmd-stats-show (rxdrop)
pmd thread numa_id 0 core_id 11:
packets received: 2069687732
packet recirculations: 0
avg. datapath passes per packet: 1.00
emc hits: 2069687636
smc hits: 0
megaflow hits: 95
avg. subtable lookups per megaflow hit: 1.00
miss with success upcall: 1
miss with failed upcall: 0
avg. packets per output batch: 0.00
idle cycles: 4196235931 (1.60%)
processing cycles: 258609877383 (98.40%)
avg cycles per packet: 126.98 (262806113314/2069687732)
avg processing cycles per packet: 124.95 (258609877383/2069687732)
120ns budget
for 20Mpps

Perf record -p `pidof ovs-vswitchd` sleep 10
26.91% pmd7 ovs-vswitchd [.] netdev_linux_rxq_xsk
26.38% pmd7 ovs-vswitchd [.] dp_netdev_input__
24.65% pmd7 ovs-vswitchd [.] miniflow_extract
6.87% pmd7 libc-2.23.so [.]__memcmp_sse4_1
3.27% pmd7 ovs-vswitchd [.] umem_elem_push
3.06% pmd7 ovs-vswitchd [.] odp_execute_actions
2.03% pmd7 ovs-vswitchd [.] umem_elem_pop
top
PID USER PR NI VIRT RES SHR S %CPU%MEM TIME+COMMAND
16root 20 0 0 0 0R 100.0 0.0 75:16.85ksoftirqd/1
21088root 20 0 451400 52656 4968S 100.0 0.2 6:58.70ovs-vswitchd
Mempool
overhead

OVS AF_XDP l2fwd
#ovs-ofctladd-flowbr0"in_port=enp2s0actions=
set_field:14->in_port,
set_field:a0:36:9f:33:b1:40->dl_src,enp2s0"
OVS AF_XDP
br0
enp2s0

pmd-stats-show (l2fwd)
emc hits: 868900164
smc hits: 0
megaflow hits: 122
idle cycles: 3344425951 (2.09%)
Extra ~55 cycles for send

Perf record -p `pidof ovs-vswitchd` sleep 10
25.92% pmd7 ovs-vswitchd [.]netdev_linux_rxq_xsk
17.75% pmd7 ovs-vswitchd [.]dp_netdev_input__
16.55% pmd7 ovs-vswitchd [.]netdev_linux_send
16.10% pmd7 ovs-vswitchd [.]miniflow_extract
3.67% pmd7 ovs-vswitchd [.]dp_execute_cb
2.86% pmd7 ovs-vswitchd [.]__umem_elem_push
2.46% pmd7 ovs-vswitchd [.]__umem_elem_pop
1.96% pmd7 ovs-vswitchd [.]non_atomic_ullong_add
1.69% pmd7 ovs-vswitchd [.]dp_netdev_pmd_flush
_output_on_port
TOP results are similar to rxdrop
Mempool
overhead

# ./configure --with-dpdk=
# ovs-ofctl add-flow br0 "in_port=enp2s0,
actions=output:vhost-user-1"
# ovs-ofctl add-flow br0 "in_port=vhost-user-1,
actions=output:enp2s0"
AF_XDP PVP Performance
• QEMU 3.0.0
• VM Ubuntu 18.04
• DPDK stable 17.11.4
• OVS-DPDK vhostuserclient port
• options:dq-zero-copy=true
• options:n_txq_desc=128
OVS AF_XDP
br0
QEMU + vhost-user
VM
XDP redirect
enp2s0
virtio

PVP CPU utilization
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+COMMAND
16root 20 0 0 0 0R 100.0 0.0 88:26.26 ksoftirqd/1
21510root 20 0 9807168 53724 5668S100.0 0.2 5:58.38 ovs-vswitchd
21662root 20 0 4894752 30576 12252S 100.0 0.1 5:21.78qemu-system-x86
21878root 20 0 41940 3832 3096R 6.2 0.0 0:00.01top

pmd-stats-show (PVP)
emc hits: 205680121
smc hits: 0
megaflow hits: 0
idle cycles: 0 (0.00%)

AF_XDP PVP Performance Evaluation
• ./perf record -p `pidof ovs-vswitchd` sleep 10
15.88% pmd28 ovs-vswitchd [.]rte_vhost_dequeue_burst
14.51% pmd28 ovs-vswitchd [.]rte_vhost_enqueue_burst
10.41% pmd28 ovs-vswitchd [.]dp_netdev_input__
8.31% pmd28 ovs-vswitchd [.]miniflow_extract
7.65% pmd28 ovs-vswitchd [.]netdev_linux_rxq_xsk
5.59% pmd28 ovs-vswitchd [.]netdev_linux_send
4.20% pmd28 ovs-vswitchd [.]dpdk_do_tx_copy
3.94% pmd28 libc-2.23.so [.]__memcpy_avx_unaligned
2.45% pmd28 ovs-vswitchd [.]free_dpdk_buf
2.43% pmd28 ovs-vswitchd [.]__netdev_dpdk_vhost_send
2.14% pmd28 ovs-vswitchd [.]miniflow_hash_5tuple
1.89% pmd28 ovs-vswitchd [.]dp_execute_cb
1.82% pmd28 ovs-vswitchd [.]netdev_dpdk_vhost_rxq_recv

Performance Result
OVS
AF_XDP
PPS CPU
RX Drop 19Mpps 200%
L2fwd [2] 14Mpps 200%
PVP [3] 3.3Mpps 300%
[1] Intel® Open Network Platform Release 2.1 Performance Test Report
[2] Demo rxdrop/l2fwd: https://www.youtube.com/watch?v=VGMmCZ6vA0s
[3] Demo PVP: https://www.youtube.com/watch?v=WevLbHf32UY
OVS
DPDK [1]
PPS CPU
RX Drop NA NA
l3fwd 13Mpps 100%
PVP 7.4Mpps 200%

Conclusion 1/2
• AF_XDP is a high-speed Linux socket type
• We add a new netdev type based on AF_XDP
• Re-use the userspace datapath used by OVS-DPDK
Performance
• Pre-allocate and pre-init as much as possible
• Batching does not reduce # of per-packet operations
• Batching + cache-aware data structure amortizes the cache misses

Conclusion 2/2
• Need high packet rate but can’t deploy DPDK? Use AF_XDP!
• Still slower than OVS-DPDK [1], more optimizations are coming [2]
Comparison with OVS-DPDK
• Better integration with Linux kernel and management tool
• Selectively use kernel’s feature, no re-injection needed
• Do not require dedicated device or CPU
[1] The eXpress Data Path: Fast Programmable Packet Processing in the Operating System Kernel
[2] The Path to DPDK Speeds for AF XDP, Linux Plumber 2018

./perf kvm stat record -p 21662 sleep 10
Analyze eventsforall VMs,all VCPUs:
VM-EXIT Samples Samples% Time% MinTime Max Time Avg time
HLT 298071 95.56% 99.91% 0.43us 511955.09us 32.95us (+- 19.18% )
EPT_MISCONFIG 10366 3.32% 0.05% 0.39us 12.35us 0.47us ( +- 0.71% )
EXTERNAL_INTERRUPT 2462 0.79% 0.01% 0.33us 21.20us 0.50us ( +- 3.21% )
MSR_WRITE 761 0.24% 0.01% 0.40us 12.74us 1.19us ( +- 3.51% )
IO_INSTRUCTION 185 0.06% 0.02% 1.98us 35.96us 8.30us ( +- 4.97% )
PREEMPTION_TIMER 62 0.02% 0.00% 0.52us 2.77us 1.04us ( +- 4.34% )
MSR_READ 19 0.01% 0.00% 0.79us 2.49us 1.37us ( +- 8.71% )
EXCEPTION_NMI 1 0.00% 0.00% 0.58us 0.58us 0.58us ( +- 0.00% )
Total Samples:311927,Total eventshandledtime:9831483.62us.

root@ovs-afxdp:~/ovs# ovs-vsctl show
2ade349f-2bce-4118-b633-dce5ac51d994
Bridge "br0"
Port "br0"
Interface"br0"
type: internal
Port "vhost-user-1"
Interface"vhost-user-1"
type: dpdkvhostuser
Port "enp2s0"
Interface"enp2s0"
type: afxdp

QEMU
qemu-system-x86_64-hda ubuntu1810.qcow
-m4096
-cpuhost,+x2apic -enable-kvm
-chardevsocket,id=char1,path=/tmp/vhost,server
-netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=4
-device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,
mq=on,vectors=10,mrg_rxbuf=on,rx_queue_size=1024
-object memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages,share=on
-numanode,memdev=mem -mem-prealloc-smp 2

Fast Userspace OVS with AF_XDP, OVS CONF 2018

Related slideshows

More Related Content

Fast Userspace OVS with AF_XDP, OVS CONF 2018

Editor's Notes