Performance challenges in software networking

1. Performance Challenges In Software Networking Stephen Hemminger @networkplumber

2. Who am I? ● Principal Architect Brocade vRouter (Vyatta) ● Fellow Linux Foundation ● Sequent Unix SMP networking ● DPDK – #3 contributor ● Linux – 10+ year contributor – Maintainer ● Bridge ● iproute

3. Agenda ● Myths ● Requirements ● Benchmarks ● Reality

4. Myths ● Software networking can never do: – 1Gbit ● 2008 – Linux, FreeBSD, ... – 10Gbit ● 2013 – DPDK, Netmap, ... – 40Gbit ● 2015 – DPDK, ... – 100Gbit ● 2016?

5. Hardware vs Software ● Clock rate ● TCAM size ● TCAM miss ● Bus transactions ● Clock rate ● Cache size ● Cache misses per packet ● PCI bus operations

6. Optimization cycle AnalyzeOptimize Measure

7. SDN Measurement Forwarding RFC2544 Scaling Imix, BGP, Firewall, ... Application BGP convergence Availablity SDN Workload Performance Test Environment

8. Benchmark vs Reality ● Benchmark – random flows – 10 or less rules – 128GB memory – 32 or more CPU ● Reality – Burstyflows – 1000's of rules – 2GB VM – 2-4 CPU

9. System effects ● Data/Control resource sharing – CPU cache – Background noise ● Power consumption ● Memory footprint ● Virtualization overhead ● Platform differences

10. Basics memory is ~70+ ns away (i.e. 2.0 GHz = 140+ cycles) Source: Intel® 64 and IA-32 Architectures: Optimization Reference Manual Sandy Bridge Ivy Bridge Haswell Skylake (bytes/cycle) 4 4 4 L1 Peak Bandwidth 2x16 2x32 load 1x32 store 2x32 load 1x32 store L2 data access (cycles) 12 11 12 L2 peak Bandwidth 1x32 64 64 Shared L3 access (cycles) 26-31 34 44 L3 peak bandwidth 32 - 32 Data hit in L2 cache 43 – clean hit 60 – modified

11. Time Budget ● 10Gbit 64 byte packet – 67.2ns = 201 cycles @ 3Ghz ● Cache – L3 = 8 ns – L2 = 4.3 ● Atomic operations – Lock = 8.25 ns – Lock/Unlock = 16.1 Network stack challenges at increasing speeds – LCA 2015 Jesper Dangaard Brouer

12. Magic Exlir?

13. Fast vs Slow ● New software – Lockless – Single function – Tight layering – Cache aware ● Legacy software – Interrupts – Shared resources – System calls – VM exit

14. Performance Tradeoffs ● Bulk operations ● Lock-less Algorithms ● Tight integration ● Polling ● Caching ➔ Latency ➔ Update speed Consistency ➔ Inflexible ➔ CPU utilization Power management ➔ Memory utilization Update overhead

15. CPU pipeline

16. Cache flow Rx Device Network Function Tx Device Rx Poll Tx Kick Tx Descriptor Rx Descriptor Function Table Accesses Worst case 7+ cache miss per packet! Packet Data

17. Cache Ping/Pong ● Cache line shared between cores – Statistics – Session state

18. NFV bucket brigade

19. Packet batching

20. New developments ● DPDK – Multi-architecture – NIC support – Packet pipeline – ACL – LPM – ... ● Linux – Batched Tx – Lockless queue disciplines – Memory allocator performance

21. Conclusions ● Software networking is function of: – Algorithims – Low level CPU utilization – Cache behavior

22. Questions?

23. Thank you Stephen Hemminger stephen@networkplumber.org @networkplumber

24. Next Generation Software Networking ● Openvswitch + DPDK ● Brocade – vRouter ● 6Wind ● FD.io – VPP ● Juniper - Opencontrail ● Huawei - Fusionsphere

25. Performance Challenges In Software Networking Stephen Hemminger @networkplumber

26. Who am I? ● Principal Architect Brocade vRouter (Vyatta) ● Fellow Linux Foundation ● Sequent Unix SMP networking ● DPDK – #3 contributor ● Linux – 10+ year contributor – Maintainer ● Bridge ● iproute

27. Agenda ● Myths ● Requirements ● Benchmarks ● Reality

28. Myths ● Software networking can never do: – 1Gbit ● 2008 – Linux, FreeBSD, ... – 10Gbit ● 2013 – DPDK, Netmap, ... – 40Gbit ● 2015 – DPDK, ... – 100Gbit ● 2016?

29. Hardware vs Software ● Clock rate ● TCAM size ● TCAM miss ● Bus transactions ● Clock rate ● Cache size ● Cache misses per packet ● PCI bus operations

30. Optimization cycle AnalyzeOptimize Measure

31. SDN Measurement Forwarding RFC2544 Scaling Imix, BGP, Firewall, ... Application BGP convergence Availablity SDN Workload Performance Test Environment

32. Benchmark vs Reality ● Benchmark – random flows – 10 or less rules – 128GB memory – 32 or more CPU ● Reality – Burstyflows – 1000's of rules – 2GB VM – 2-4 CPU

33. System effects ● Data/Control resource sharing – CPU cache – Background noise ● Power consumption ● Memory footprint ● Virtualization overhead ● Platform differences

34. Basics memory is ~70+ ns away (i.e. 2.0 GHz = 140+ cycles) Source: Intel® 64 and IA-32 Architectures: Optimization Reference Manual Sandy Bridge Ivy Bridge Haswell Skylake (bytes/cycle) 4 4 4 L1 Peak Bandwidth 2x16 2x32 load 1x32 store 2x32 load 1x32 store L2 data access (cycles) 12 11 12 L2 peak Bandwidth 1x32 64 64 Shared L3 access (cycles) 26-31 34 44 L3 peak bandwidth 32 - 32 Data hit in L2 cache 43 – clean hit 60 – modified

35. Time Budget ● 10Gbit 64 byte packet – 67.2ns = 201 cycles @ 3Ghz ● Cache – L3 = 8 ns – L2 = 4.3 ● Atomic operations – Lock = 8.25 ns – Lock/Unlock = 16.1 Network stack challenges at increasing speeds – LCA 2015 Jesper Dangaard Brouer

36. Magic Exlir?

37. Fast vs Slow ● New software – Lockless – Single function – Tight layering – Cache aware ● Legacy software – Interrupts – Shared resources – System calls – VM exit

38. Performance Tradeoffs ● Bulk operations ● Lock-less Algorithms ● Tight integration ● Polling ● Caching ➔ Latency ➔ Update speed Consistency ➔ Inflexible ➔ CPU utilization Power management ➔ Memory utilization Update overhead

39. CPU pipeline

40. Cache flow Rx Device Network Function Tx Device Rx Poll Tx Kick Tx Descriptor Rx Descriptor Function Table Accesses Worst case 7+ cache miss per packet! Packet Data

41. Cache Ping/Pong ● Cache line shared between cores – Statistics – Session state

42. NFV bucket brigade

43. Packet batching

44. New developments ● DPDK – Multi-architecture – NIC support – Packet pipeline – ACL – LPM – ... ● Linux – Batched Tx – Lockless queue disciplines – Memory allocator performance

45. Conclusions ● Software networking is function of: – Algorithims – Low level CPU utilization – Cache behavior

46. Questions?

47. Thank you Stephen Hemminger stephen@networkplumber.org @networkplumber

48. Next Generation Software Networking ● Openvswitch + DPDK ● Brocade – vRouter ● 6Wind ● FD.io – VPP ● Juniper - Opencontrail ● Huawei - Fusionsphere

Performance challenges in software networking

Related slideshows

More Related Content

Performance challenges in software networking