Stephen Hemminger discusses performance challenges in software networking. He addresses myths about throughput limits and compares hardware and software approaches. Optimization requires analyzing bottlenecks like CPU cache usage and avoiding locks. While benchmarks use ideal conditions, real systems have bursty traffic, many rules, and limited resources. The performance of software networking depends on algorithms, CPU efficiency, and handling cache behavior.
8. Benchmark vs Reality
● Benchmark
– random flows
– 10 or less rules
– 128GB memory
– 32 or more CPU
● Reality
– Burstyflows
– 1000's of rules
– 2GB VM
– 2-4 CPU
9. System effects
● Data/Control resource sharing
– CPU cache
– Background noise
● Power consumption
● Memory footprint
● Virtualization overhead
● Platform differences
10. Basics
memory is ~70+ ns away (i.e. 2.0 GHz = 140+ cycles)
Source: Intel® 64 and IA-32 Architectures: Optimization Reference Manual
Sandy Bridge
Ivy Bridge
Haswell Skylake
(bytes/cycle) 4 4 4
L1 Peak Bandwidth 2x16 2x32 load
1x32 store
2x32 load
1x32 store
L2 data access (cycles) 12 11 12
L2 peak Bandwidth 1x32 64 64
Shared L3 access (cycles) 26-31 34 44
L3 peak bandwidth 32 - 32
Data hit in L2 cache 43 – clean hit
60 – modified
13. Fast vs Slow
● New software
– Lockless
– Single function
– Tight layering
– Cache aware
● Legacy software
– Interrupts
– Shared resources
– System calls
– VM exit
32. Benchmark vs Reality
● Benchmark
– random flows
– 10 or less rules
– 128GB memory
– 32 or more CPU
● Reality
– Burstyflows
– 1000's of rules
– 2GB VM
– 2-4 CPU
33. System effects
● Data/Control resource sharing
– CPU cache
– Background noise
● Power consumption
● Memory footprint
● Virtualization overhead
● Platform differences
34. Basics
memory is ~70+ ns away (i.e. 2.0 GHz = 140+ cycles)
Source: Intel® 64 and IA-32 Architectures: Optimization Reference Manual
Sandy Bridge
Ivy Bridge
Haswell Skylake
(bytes/cycle) 4 4 4
L1 Peak Bandwidth 2x16 2x32 load
1x32 store
2x32 load
1x32 store
L2 data access (cycles) 12 11 12
L2 peak Bandwidth 1x32 64 64
Shared L3 access (cycles) 26-31 34 44
L3 peak bandwidth 32 - 32
Data hit in L2 cache 43 – clean hit
60 – modified
37. Fast vs Slow
● New software
– Lockless
– Single function
– Tight layering
– Cache aware
● Legacy software
– Interrupts
– Shared resources
– System calls
– VM exit