Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Kernel Networking Walkthrough

Thomas Graf – Principal Software Engineer
Networking Services
Red Hat
Feb 7, 2014

1

Kernel Networking Walkthrough
Agenda
●

How does a packet get in and out of the net stack?
●

●

How does a packet get through the net stack?
●

●

2

RX Handler, IP Processing, TCP Processing, TCP
Fast Open

How to account for memory and do flow control?
●

●

NAPI, Busy Polling, RSS, RPS, XPS, GRO, TSO

Socket Buffers, Flow Control, TCP Small Queues

Q&A

Kernel Networking Walkthrough
Touring the Network Stack
Expectation

3

Reality

Kernel Networking Walkthrough
How does a packet get in and out of the Network
Stack?

4

Kernel Networking Walkthrough
Receive & Transmit Process

NIC

Network Stack
(Kernel Space)

Ring Buffer

Parse
IP

Parse
TCP/UDP

Socket Buffer read()

Forward

DMA

Device?
Ring Buffer

5

Local?

Process
(User Space)

Task
Construct
IP

Construct
TCP/UDP

Kernel Networking Walkthrough

write()
Socket Buffer
The 3 ways into the Network Stack
Interrupt Driven
Network
Stack

Ring Buffer

NAPI based Polling

poll()

Network
Stack

Ring Buffer

Busy Polling

busy_poll()

Task

Network
Stack

Ring Buffer

6

Kernel Networking Walkthrough
RSS – Receive Side Scaling
●

●

NIC distributes packets across multiple RX queues
allowing for parallel processing.
Separate IRQ per RX queue, thus selects CPU to
run hardware interrupt handler on.
RX-queue-1
CPU 1
RX-queue-2
CPU 3
filter

RX-queue-3
CPU 1
RX-queue-4
CPU 5

7

Kernel Networking Walkthrough
RPS – Receive Packet Steering
●

Software filter to select CPU # for processing

●

Use it to ...
... distribute single queue to
multiple CPUs

... redo queue - CPU mapping
RX-queue-1

RX-queue-2

RX-queue-3

RX-queue-4

8

CPU 1

CPU 1

CPU 2

CPU 2

CPU 3

CPU 3

Kernel Networking Walkthrough
Hardware Offload
●

RX/TX Checksumming
●

●

Virtual LAN filtering and tag stripping
●

●

9

Perform CPU intensive
checksumming in hardware.

Strip 802.1Q header and store VLAN
ID in network packet meta data.
Filter out unsubscribed VLANs.

Kernel Networking Walkthrough
Generic Receive Offload

NAPI based GRO
poll()

Network
Stack

Ring Buffer

GRO
MTU

10

Kernel Networking Walkthrough

Up to 64K
Segmentation Offload
Up to 64K

Network
Stack

Generic Segmentation Offload (GSO)

MTU

Ring Buffer

TCP Segmentation Offload (TSO)

MTU

11

Kernel Networking Walkthrough
How does a packet get through the Network
Stack?

(c) Karen Sagovac

12

Kernel Networking Walkthrough
Packet Processing
Link Layer

Packet Socket
ETH_P_ALL
Ingress QoS

tcpdump

Bridge
Open vSwitch

RX Handler

Team
Bonding
macvlan
macvtap

IPv4
Proto Handler

IPv6
ARP

Feast of the hungry chicks

IPX
Drop

13

Kernel Networking Walkthrough

...
IP Processing

PREROUTING
IP
Handler

INPUT

Route Lookup

Local Delivery

Forwarding

L4
(TCP, ...)

FORWARD
Route Lookup
Link Layer

IPv4
Construction

POSTROUTING

OUTPUT

14

Kernel Networking Walkthrough

Local Output

User
Space
TCP Processing
IP

Parse TCP
Lookup Socket

Socket Filter
socket locked
task exists

Receive TCP

Prequeue

process context ← softirq

Receive Socket Buffer
read()

poll()

Task

15

Backlog

Kernel Networking Walkthrough
TCP Fast Open

(net.ipv4.tcp_fastopen)
Regular

Fast Open

Client
1st Req

Server

Client
1st Req

SYN

ACK
SYN+

2x RTT

ACK+
HTTP
GE

Server

2x RTT
T

SYN

ookie
CK+C
A
SYN+
ACK+
HTTP
GET

Data

2nd Req

Data

2nd Req

SYN

1x RTT

ACK
SYN+

2x RTT

ACK+
HTTP
GE

T

Data

16

Kernel Networking Walkthrough

SYN+

Cook
ie+

HTTP

GET

+Data
+ACK
SYN
Memory Accounting & Flow Control

17

Kernel Networking Walkthrough
Socket Buffers & Flow Control
(net.ipv4.tcp_{r|w}mem)

ssh

ssh
Block or EWOULDBLOCK

write()

rmem -= packet-size

wmem
overlimit?

Socket Buffer

rmem += packet-size

wmem += packet-size

rmem
overlimit?

Socket Buffer
Reduce TCP Window

TCP/IP

TCP/IP

TX Ring Buffer
wmem -= packet-size

18

Kernel Networking Walkthrough

RX Ring Buffer
TCP Small Queues

(net.ipv4.tcp_limit_output_bytes)
ssh

torrent
write()

write()

Socket Buffer

Socket Buffer

TSQ: max 128Kb in flight per socket
TCP/IP

Queuing Discipline

Driver

TX Ring Buffer

19

Kernel Networking Walkthrough
Q&A

Feedback Page
●

http://devconf.cz/f/1

Coming Up Next:
NetworkManager for Enterprise
Dan Williams

20

Kernel Networking Walkthrough

More Related Content

DevConf 2014 Kernel Networking Walkthrough

  • 1. Kernel Networking Walkthrough Thomas Graf – Principal Software Engineer Networking Services Red Hat Feb 7, 2014 1 Kernel Networking Walkthrough
  • 2. Agenda ● How does a packet get in and out of the net stack? ● ● How does a packet get through the net stack? ● ● 2 RX Handler, IP Processing, TCP Processing, TCP Fast Open How to account for memory and do flow control? ● ● NAPI, Busy Polling, RSS, RPS, XPS, GRO, TSO Socket Buffers, Flow Control, TCP Small Queues Q&A Kernel Networking Walkthrough
  • 3. Touring the Network Stack Expectation 3 Reality Kernel Networking Walkthrough
  • 4. How does a packet get in and out of the Network Stack? 4 Kernel Networking Walkthrough
  • 5. Receive & Transmit Process NIC Network Stack (Kernel Space) Ring Buffer Parse IP Parse TCP/UDP Socket Buffer read() Forward DMA Device? Ring Buffer 5 Local? Process (User Space) Task Construct IP Construct TCP/UDP Kernel Networking Walkthrough write() Socket Buffer
  • 6. The 3 ways into the Network Stack Interrupt Driven Network Stack Ring Buffer NAPI based Polling poll() Network Stack Ring Buffer Busy Polling busy_poll() Task Network Stack Ring Buffer 6 Kernel Networking Walkthrough
  • 7. RSS – Receive Side Scaling ● ● NIC distributes packets across multiple RX queues allowing for parallel processing. Separate IRQ per RX queue, thus selects CPU to run hardware interrupt handler on. RX-queue-1 CPU 1 RX-queue-2 CPU 3 filter RX-queue-3 CPU 1 RX-queue-4 CPU 5 7 Kernel Networking Walkthrough
  • 8. RPS – Receive Packet Steering ● Software filter to select CPU # for processing ● Use it to ... ... distribute single queue to multiple CPUs ... redo queue - CPU mapping RX-queue-1 RX-queue-2 RX-queue-3 RX-queue-4 8 CPU 1 CPU 1 CPU 2 CPU 2 CPU 3 CPU 3 Kernel Networking Walkthrough
  • 9. Hardware Offload ● RX/TX Checksumming ● ● Virtual LAN filtering and tag stripping ● ● 9 Perform CPU intensive checksumming in hardware. Strip 802.1Q header and store VLAN ID in network packet meta data. Filter out unsubscribed VLANs. Kernel Networking Walkthrough
  • 10. Generic Receive Offload NAPI based GRO poll() Network Stack Ring Buffer GRO MTU 10 Kernel Networking Walkthrough Up to 64K
  • 11. Segmentation Offload Up to 64K Network Stack Generic Segmentation Offload (GSO) MTU Ring Buffer TCP Segmentation Offload (TSO) MTU 11 Kernel Networking Walkthrough
  • 12. How does a packet get through the Network Stack? (c) Karen Sagovac 12 Kernel Networking Walkthrough
  • 13. Packet Processing Link Layer Packet Socket ETH_P_ALL Ingress QoS tcpdump Bridge Open vSwitch RX Handler Team Bonding macvlan macvtap IPv4 Proto Handler IPv6 ARP Feast of the hungry chicks IPX Drop 13 Kernel Networking Walkthrough ...
  • 14. IP Processing PREROUTING IP Handler INPUT Route Lookup Local Delivery Forwarding L4 (TCP, ...) FORWARD Route Lookup Link Layer IPv4 Construction POSTROUTING OUTPUT 14 Kernel Networking Walkthrough Local Output User Space
  • 15. TCP Processing IP Parse TCP Lookup Socket Socket Filter socket locked task exists Receive TCP Prequeue process context ← softirq Receive Socket Buffer read() poll() Task 15 Backlog Kernel Networking Walkthrough
  • 16. TCP Fast Open (net.ipv4.tcp_fastopen) Regular Fast Open Client 1st Req Server Client 1st Req SYN ACK SYN+ 2x RTT ACK+ HTTP GE Server 2x RTT T SYN ookie CK+C A SYN+ ACK+ HTTP GET Data 2nd Req Data 2nd Req SYN 1x RTT ACK SYN+ 2x RTT ACK+ HTTP GE T Data 16 Kernel Networking Walkthrough SYN+ Cook ie+ HTTP GET +Data +ACK SYN
  • 17. Memory Accounting & Flow Control 17 Kernel Networking Walkthrough
  • 18. Socket Buffers & Flow Control (net.ipv4.tcp_{r|w}mem) ssh ssh Block or EWOULDBLOCK write() rmem -= packet-size wmem overlimit? Socket Buffer rmem += packet-size wmem += packet-size rmem overlimit? Socket Buffer Reduce TCP Window TCP/IP TCP/IP TX Ring Buffer wmem -= packet-size 18 Kernel Networking Walkthrough RX Ring Buffer
  • 19. TCP Small Queues (net.ipv4.tcp_limit_output_bytes) ssh torrent write() write() Socket Buffer Socket Buffer TSQ: max 128Kb in flight per socket TCP/IP Queuing Discipline Driver TX Ring Buffer 19 Kernel Networking Walkthrough
  • 20. Q&A Feedback Page ● http://devconf.cz/f/1 Coming Up Next: NetworkManager for Enterprise Dan Williams 20 Kernel Networking Walkthrough