Understanding DPDK

Understanding
DPDK
Description of techniques used to achieve
high throughput on a commodity hardware

How fast SW has to work?
14.88 millions of 64 byte packets per second on 10G interface
1.8 GHz -> 1 cycle = 0,55 ns
1 packet -> 67.2 ns = 120 clock cycles
IFG
Pream
ble
DST
MAC
SRC
MAC
SRC
MAC
Type Payload CRC
84 Bytes
412 8 60

Comparative speed values
CPU to memory speed = 6-8 GBytes/s
PCI-Express x16 speed = 5 GBytes/s
Access to RAM = 200 ns
Access to L3 cache = 4 ns
Context switch ~= 1000 ns (3.2 GHz)

Packet processing in Linux
User space
Kernel space
NIC
App
Driver
RX/TX queues
Socket
Ring
buffers

Linux kernel overhead
System calls
Context switching on blocking I/O
Data copying from kernel to user space
Interrupt handling in kernel

Expense of sendto
Function Activity Time (ns)
sendto system call 96
sosend_dgram lock sock_buff, alloc mbuf, copy in 137
udp_output UDP header setup 57
ip_output route lookup, ip header setup 198
ether_otput MAC lookup, MAC header setup 162
ixgbe_xmit device programming 220
Total 950

Packet processing with DPDK
User space
Kernel space
NIC
App DPDK
Ring
buffers
UIO driver
RX/TX
queues

Kernel space
Updating a register in Linux
User space
HW
ioctl()
Register
syscall
VFS
copy_from_user()
iowrite()

Updating a register with DPDK
User space
HW
assign
Register

What is used inside DPDK?
Processor affinity (separate cores)
Huge pages (no swap, TLB)
UIO (no copying from kernel)
Polling (no interrupts overhead)
Lockless synchronization (avoid waiting)
Batch packets handling
SSE, NUMA awareness

Linux default scheduling
Core 0
Core 1
Core 2
Core 3
t1 t4t3t2

How to isolate a core for a process
To diagnose use top
“top” , press “f” , press “j”
Before boot use isolcpus
“isolcpus=2,4,6”
After boot - use cpuset
“cset shield -c 1-3”, “cset shield -k on”

Core 2Core 1
Run-to-completion model
RX/TX
thread
RX/TX
thread
Port 1 Port 2

Core 2Core 1
Pipeline model
RX
thread
TX
thread
Port 1 Port 2
Ring

Page tables tree
Linux paging model
cr3
Page
Page
Global
Directory
Page
Table
Page
Middle
Directory

TLB
TLB
Page
Table
RAM
OffsetVirtual page
Physical Page Offset

TLB characteristics
$ cpuid | grep -i tlb
size: 12–4,096 entries
hit time: 0.5–1 clock cycle
miss penalty: 10–100 clock cycles
miss rate: 0.01–1%
It is very expensive resource!

Solution - Hugepages
Benefit: optimized TLB usage, no swap
Hugepage size = 2M
Usage:
mount hugetlbfs /mnt/huge
mmap
Library - libhugetlbfs

Lockless ring design
Writer can preempt writer and reader
Reader can not preempt writer
Reader and writer can work simultaneously on
different cores
Barrier
CAS operation
Bulk queue/dequeue

Lockless ring (Single Producer)
1
cons_head
cons_tail
prod_head
prod_tail
prod_next 2
cons_head
cons_tail
prod_head
prod_next
prod_tail
3
cons_head
cons_tail
prod_head
prod_tail

Lockless ring (Single Consumer)
1
cons_head
cons_tail
prod_head
prod_tail
cons_next 2
cons_tail prod_head
prod_tail
cons_next
cons_head
3
cons_head
cons_tail
prod_head
prod_tail

Lockless ring (Multiple Producers)
1
cons_head
cons_tail
prod_head
prod_tail
prod_next1
prod_next2 3
cons_head
cons_tail
prod_head
2
cons_head
cons_tail
prod_head
prod_next2
prod_tail
prod_next1
4
cons_head
cons_tail
5
cons_head
cons_tail
prod_head
prod_tail
prod_tail
prod_head
prod_tail
prod_next1
prod_next2
prod_next1
prod_next2

Kernel space network driver
App
IP stack
Driver
NIC
Data
Desc
Config
Data
User space
Kernel space
Interrupts

UIO
“The most important devices can’t be handled
in user space, including, but not
limited to, network interfaces and block
devices.” - LDD3

UIO
User space
Kernel space
Interfacesysfs /dev/uioX
App
US driver epoll()
mmap()
UIO framework
driver

NIC User space
Access to device from user space
BAR0 (Mem)
BAR1
BAR2 (IO)
BAR5
BAR4
BAR3
Vendor Id
Device Id
Command
Revision Id
Status
...
Configuration
registers
I/O and memory
regions
/sys/class/uio/uioX/maps/mapX
/sys/class/uio/uioX/portio/portX
/dev/uioX -> mmap (offset)
/sys/bus/pci/devices

Host memory NIC memory
DMA RX
Update RDT
DMA descriptor(s)
RX queue RX FIFO
DMA packet
Descriptor ringMemory
DMA descriptors

Host memory NIC memory
DMA TX
Update TDT
DMA descriptor(s)
TX queue TX FIFO
DMA packet
Descriptor ringMemory
DMA descriptors

Receive from SW side
DD DD DDDD
RDT
DD
mbuf1
addr
DD
mbuf2
addr
RDT
RDH = 1
RDT = 5
RDBA = 0
RDLEN = 6
mbuf1
RDH
RDH
mbuf2

Transmit from SW side
DD DD DDDD
TDT
DD
mbuf1
addr
DD
mbuf2
addr
TDT
TDH = 1
TDT = 5
TDBA = 0
TDLEN = 6
mbuf1
TDH
TDH
mbuf2

NUMA
CPU 0
Cores
Memory
controller
I/O controller
Memory
PCI-E PCI-E
CPU 1
Cores
Memory
controller
I/O controller
Memory
PCI-E PCI-E
QPI
Socket 0 Socket 1

RSS (Receive Side Scaling)
Hash
function
Queue 0 CPU N
...
Queue N
Incoming traffic Indirection
table

Flow director
Queue 0 CPU N
...
Queue N
Incoming traffic
Filter table
Hash
function
Outgoing traffic
Drop Route

Virtualization - SR-IOV
NIC
VMM
VM1
VF driver
VM2
VF driver
PF driver
VF
Virtual bridge
VF PF

NIC
Slow path using bifurcated driver
Kernel DPDK
VF
Virtual bridge
PF Filter table

Slow path using TAP
User space
Kernel space
NIC
App DPDK
Ring
buffers
TAP device
RX/TX
queues
TCP/IP
stack

Slow path using KNI
User space
Kernel space
NIC
App DPDK
Ring
buffers
KNI device
RX/TX
queues
TCP/IP
stack

x86 HW
Application 1 - Traffic generator
User space
Streams generator
DUT
Traffic analyzer

x86 HW
Application 2 - Router
Kernel
User space
Routing table
Routing table cacheDUT1 DUT2

x86 HW
Application 3 - Middlebox
User space
DPIDUT1 DUT2

References
Device Drivers in User Space
Userspace I/O drivers in a realtime context
The Userspace I/O HOWTO
The anatomy of a PCI/PCI Express kernel driver
From Intel® Data Plane Development Kit to Wind River Network Acceleration
Platform
DPDK Design Tips (Part 1 - RSS)
Getting the Best of Both Worlds with Queue Splitting (Bifurcated Driver)
Design considerations for efficient network applications with Intel® multi-core
processor-based systems on Linux
Introduction to Intel Ethernet Flow Director

My blog
Learning Network Programming

Understanding DPDK

More Related Content

Understanding DPDK