DPDK Contr
DPDK Contr
DPDK Contr
Let’s do SDN! By using DPDK as an open source data plane for Contrail/Tungsten Fabric vRouter,
you’ll learn about DPDK and its related technologies (huge page, NUMA, CPU pinning, etc.) as well
as the vRouter DPDK design and details on installation.
“This book is a must read for network software developers. It covers in great detail how Network Applications
can be accelerated with Data Plane Development Kit (DPDK). In particular, it describes libraries, tools and
techniques to optimize Network Function Virtualization (NFV) and Software Defined Network (SDN) data
plane performance by more than a factor of ten. It is an excellent showcase of a close knit collaboration
between Intel and Juniper engineers over many years to deliver high performance and cloud scale applica-
tions for the networking industry. I am impressed with its thoroughness and wealth of practical, hands-on
information. In summary, this book rocks.”
Rajesh Gadiyar, Vice President and CTO, Network Platforms Group, Intel Corporation
“This is a superb book. Four Juniper engineers have combined their experience in working with DPDK and
its use as an open source data plane for SDN (vRouter). Step-by-step the authors describe the innards of
vRouter and show you how to configure, optimize, and troubleshoot one of the best SDN solutions in the
marketplace. Congratulations to the authors, and to you the reader who are about to be impressed.”
Raj Yavatkar, CTO, Juniper Networks
IT’S DAY ONE AND YOU HAVE A JOB TO DO, SO LEARN HOW TO:
n Understand SDN basics.
n Apply DPDK and network virtualization technologies.
n Identify Contrail vRouter DPDK internal architectures.
Inside the software-defined, high-performance,
n Manage packet forwarding flows in DPDK vRouter. feature rich, open source Tungsten Fabric virtual router.
n Install contrail and the traffic testing tools.
n Be familiar with the utilities available for DPDK vRouter to troubleshoot and analyze performance.
Grygiel, Durand
Kiran, Song,
ISBN 978-1936779895
51400
Juniper Networks Books are focused on network reliability and
By Kiran KN, Ping Song, Przemyslaw Grygiel, Laurent Durand
efficiency. Peruse the complete library at www.juniper.net/books.
9 781936 779895
DAY ONE: CONTRAIL DPDK vROUTER
Let’s do SDN! By using DPDK as an open source data plane for Contrail/Tungsten Fabric vRouter,
you’ll learn about DPDK and its related technologies (huge page, NUMA, CPU pinning, etc.) as well
as the vRouter DPDK design and details on installation.
“This book is a must read for network software developers. It covers in great detail how Network Applications
can be accelerated with Data Plane Development Kit (DPDK). In particular, it describes libraries, tools and
techniques to optimize Network Function Virtualization (NFV) and Software Defined Network (SDN) data
plane performance by more than a factor of ten. It is an excellent showcase of a close knit collaboration
between Intel and Juniper engineers over many years to deliver high performance and cloud scale applica-
tions for the networking industry. I am impressed with its thoroughness and wealth of practical, hands-on
information. In summary, this book rocks.”
Rajesh Gadiyar, Vice President and CTO, Network Platforms Group, Intel Corporation
“This is a superb book. Four Juniper engineers have combined their experience in working with DPDK and
its use as an open source data plane for SDN (vRouter). Step-by-step the authors describe the innards of
vRouter and show you how to configure, optimize, and troubleshoot one of the best SDN solutions in the
marketplace. Congratulations to the authors, and to you the reader who are about to be impressed.”
Raj Yavatkar, CTO, Juniper Networks
IT’S DAY ONE AND YOU HAVE A JOB TO DO, SO LEARN HOW TO:
n Understand SDN basics.
n Apply DPDK and network virtualization technologies.
n Identify Contrail vRouter DPDK internal architectures.
Inside the software-defined, high-performance,
n Manage packet forwarding flows in DPDK vRouter. feature rich, open source Tungsten Fabric virtual router.
n Install contrail and the traffic testing tools.
n Be familiar with the utilities available for DPDK vRouter to troubleshoot and analyze performance.
Grygiel, Durand
Kiran, Song,
ISBN 978-1936779895
51400
Juniper Networks Books are focused on network reliability and
By Kiran KN, Ping Song, Przemyslaw Grygiel, Laurent Durand
efficiency. Peruse the complete library at www.juniper.net/books.
9 781936 779895
Day One: Contrail DPDK vRouter
© 2021 by Juniper Networks, Inc. All rights reserved. Laurent Durand is a technical consultant in Juniper
Juniper Networks and Junos are registered trademarks of Networks He started as C/C++ developer 25 years ago. In
Juniper Networks, Inc. in the United States and other early 2000, he worked as a Network and System Engineer.
countries. The Juniper Networks Logo and the Junos logo, Later, while working as a Network Architect, he designed
are trademarks of Juniper Networks, Inc. All other country wide IP MPLS networks (Mobile and Fix), and
trademarks, service marks, registered trademarks, or VoIP solutions for some European Telcos. For the last few
registered service marks are the property of their respective years he has worked as Cloud solutions Architect; he has
owners. Juniper Networks assumes no responsibility for also been working on SDN infrastructures and teaches
any inaccuracies in this document. Juniper Networks Network Virtualization in some Paris engineering schools
reserves the right to change, modify, transfer, or otherwise
revise this publication without notice. Authors’ Acknowledgments
We’d all like to thank Patrick Ames for his encouragement
Published by Juniper Networks Books and support during the time of writing this book. And
Authors: Kiran K N, Ping Song, Przemyslaw Grygiel, thank you to Nancy Koerbel for the expert editing and
Laurent Durand proofing.
Technical Reviewers: Vincent Zhang, Richard Roberts,
T. Sridhar Kiran: Writing the book would not have been possible
Editor in Chief: Patrick Ames without the help and support from my family members and
Copyeditor: Nancy Koerbel my managers at Juniper. I would like to thank my parents -
Printed in the USA by Vervante Corporation. Mr. Prasad K N and Mrs. Gowri Prasad K N for their
Version History: v1, January. 2021 constant support. I would also like to thank Juniper CTO
2 3 4 5 6 7 8 9 10 Raj Yavatkar, Juniper VPs - Rakesh Manocha, and T.
Comments, errata: dayone@juniper.net Sridhar for their help and guidance in writing this book.
Last but not least, I want to thank my teammates Ping,
Przemyslaw, and Laurent for a great and fun-filled
About the Authors collaboration.
Kiran K N is a Principal Engineer in Juniper Networks,
with more than 15 years of experience in the networking Ping: This book was written during the most special year
industry. He graduated from the Indian Institute of - 2020. Needless to say, it has been tough and full of
Technology with a Masters degree in Computer Science. uncertainties for everyone, but I am positive we will get
His current area of interest is Software Defined Networks through this soon. I would like to thank Laurent, Kiran,
and datapath technology. He is an expert in DPDK and an and Przemysław, my partners in this book, for their deep
active developer of Contrail vRouter. He has made knowledge and helpful technical discussions during the past
significant contributions towards the architecture, few months. Thanks to my manager Siew Ng, for being
hardening, features, and performance enhancements of supportive of the contrail book project, and for allowing
vRouter. me to focus more on the book during the last few weeks. In
that regard, I’d like to also thank my CFTS SDN team-
Ping Song is a technical support engineer at Juniper mates, who offloaded parts of my routine work during the
Networks. As a network engineer, he currently supports book writing process. Lastly, I’d like to thank my wife,
customers building and maintaining their data centers with Sandy, for her support on my work during the pandemic,
Juniper contrail networking SDN solution. Ping is also an and my lovely kids Xixi and Jeremy for all the joy they
enthusiastic Linux and Vim power user. After work, Ping brought. Thank you all!
enjoys gardening work and reading Chinese literature. Ping
holds active double CCIE#26084 (R&S, SP) and triple Przemysław: I would like to thank my family and my
JNCIE (SP#2178, ENT#775, DC#239) certifications. manager for their support during writing of this book.
Przemyslaw Grygiel is a Principal Engineer in Juniper Laurent: I’d like to thank all my teammates for their
Networks with 18 years of experience in the cloud and support on Contrail DPDK and for their deep
networking industry. He is an expert in cloud computing understanding and troubleshooting.
and SDNs and has seven years of experience with Juniper
Contrail. Przemyslaw holds CCIE #15278 (R&S).
v
Terminology
Throughout this book, the authors use the terms Contrail, OpenContrail, Tungsten
Fabric, and TF interchangeably.
This book assumes that you have some basic knowledge about SDN and
Contrail architecture.
You will need to have two or more Intel Xeon servers with Linux OS and
DPDK-compatible NIC cards (For example Intel x710, Intel 82599).
vi
You will know how to install contrail and the traffic testing tools.
You will be familiar with the utilities available for DPDK vRouter to trouble-
shoot and analyze DPDK performance.
NOTE This book replaces the term “slave” with the term “client.”
Chapter 1
SDN Review
This progress was possible thanks to the use of proprietary TCAM (Ternary Con-
tent-Addressable Memory) and ASICs (Application-Specific Integrated Circuits),
which are designed to perform table look up and data packet forwarding at ex-
tremely high speeds.
8 Chapter 1: SDN Review
In early 2000, the virtualization of x86 systems led to several innovations in sys-
tems domains. And evolution in compute virtualization and high-speed network
devices has enabled network cloud creation.
Since it isn’t convenient to manage several isolated network devices, each of which
may have its own configuration language, the following needs have emerged:
Single point of configuration
Configuration protocol standardization
Good performance
All of which calls for more cloud and SDN technology development.
The history of SDN development is not straightforward and is more nuanced than
a single storyline suggests. It’s far more complex than can be described in this
short section. Figure 1.2 from [sdn-history] shows developments in programma-
ble networking over the past 20 years, and their chronological relationship to ad-
vances in network virtualization.
So What is SDN?
Both the concept of SDN, and the term itself, are broad and often confusing.
There is no truly accurate definition of SDN, and vendors usually explain it very
differently. Initially it was used to describe Stanford’s OpenFlow project, but to-
day that definition has been extended to include a much wider swath of technolo-
gies. Discussion about each vendor’s exact SDN definition is beyond the scope of
this book, but in general, an SDN solution provides anywhere from one to several
of the following characteristics:
A network control and configuration plane split from the network data plane
You can see in Figure 1.3 that SDN allows simple high-level policies in the applica-
tion layer to modify the network, because the device level dependency is eliminat-
ed to some extent. The network administrator can operate the different
vendor-specific devices in the infrastructure layer from a single software console
– the control layer. The controller in the control layer is designed in such a way
that it can globally view the whole network. This controller design helps to intro-
duce functionalities or programs, since the applications just need to talk to the
centralized controller without needing to know all the details communicating with
each individual device. These details are hidden by the controller from the
applications.
Several traits fit this new model:
Openness: Communication between controller and network device uses
standardized protocols like REST, OpenFlow, XMPP, NetConf, gRPC, etc.
This eliminates traditional vendor lock-in, giving you freedom of choice in
networking.
Cost reduction: Due to the open model, users can pick any low-cost vendor
for their infrastructure (hardware).
11 So What is SDN?
Automation: The controller layer has a global view of whole network. With
the APIs exposed by the control layer, automation of applications becomes
much easier.
In Figure 1.3, OpenFlow is labeled as the protocol between the control and infra-
structure layers. This is just an example showing the use of standard communica-
tion protocols. Today more choices of communication protocols are available, and
are the standard in the SDN industry, which is covered later in this chapter.
NOTE In practice, routing and forwarding databases in a router are much more
complicated. For example, whenever MPLS is involved there will also be a Label
Information Base (LIB) and a label forwarding information base (LFIB), but we
won’t cover these details in this book.
For example, a simplified Juniper MX Series control plane typically looks like the
one illustrated in Figure 1.7.
15 Primary Changes Between SDN and Traditional Networking
Running a control plane on each router is very hard to manage, because each indi-
vidual network device needs to be carefully configured. Extensive, vendor-specific
experiences and skills are required to configure each device. The high number of
configuration points can make it challenging to build a robust network. Flexibility
is also a recurring hurdle for traditional networks since most routers run propri-
etary hardware and software. In contrast, in SDN networking control and con-
figuration functions are gathered into a SDN controller, which controls network
devices. The new architecture is intended to provide a completely new way of con-
figuring the network. This new cloud infrastructure brings:
simplified routers, without complex control planes in each router.
Figure 1.8 Comparison Between Traditional Network Devices and SDN Devices
16 Chapter 1: SDN Review
You can see that the SDN infrastructure uses a centralized configuration and con-
trol point. Route calculation is done centrally in the controller and distributed into
each SDN network node. While the idea looks simple, two fundamental protocols
and infrastructures must be implemented before the model can work:
A southbound network protocol: This is needed to allow routing information
being exchanged between the SDN controller and each controlled element.
A underlay network: This is a network infrastructure that allows the commu-
nication between SDN controller and SDN network nodes, and also the
communication between SDN nodes themselves.
The underlay network infrastructure plays the same role as the local switch fabric
inside a standalone router between the control processor card and lines cards. An
overlay network based on it can be built by the controller, which basically hides
underlay network infrastructure details from the applications so they focus on the
high level service implementations. We’ll discuss underlay and overlay more in the
next section.
This model also makes the controller the weakest point. Think of what will hap-
pen if this SDN controller, serving as the brain, stops working. Everything freezes
and nothing works as expected, or even worse, some part of the infrastructure
continues to run but in an unexpected way, which will very likely trigger bigger
issues in other parts of the network.
Each SDN solution supplier has put forth a lot of effort to solve this weakness. Us-
ing a clustered architecture to build a highly resilient controller cluster is a com-
mon and efficient practice. For example, three SDN controllers can load balance
and/or backup each other. If one or two fails, the other one can still make the
whole cluster survive, giving the operator longer maintenance windows to fix the
problem.
Tunneling protocols used in SDN networks have to provide at least the following
capabilities:
The ability to build connectivity for several different networks between two
SDN network nodes. This is called network segmentation.
The ability to transparently carry Ethernet frames and IP packets.
The ability to be carried over IP connectivity.
NVGRE
Geneve
The tunneling protocols shown in Figure 1.10 provide Overlay connectivity, which
is required between customer workloads connected to the SDN infrastructure.
NOTE There are also other protocols not included in this list. They are either non
standardized or not actively used in the industry. One such example is STT - State-
less Transport Tunneling. (https://www.ietf.org/archive/id/draft-davie-stt-08.txt).
TIP In VxLAN, specifically, each SDN node is called a VTEP (virtual tunnel end
point) as it is starting and terminating the overlay tunnels.
Southbound Interface
The southbound interface resides between the controller in the control layer and
network devices in the infrastructure layer and provides a means of communica-
tion between the two layers. Based on demands and needs, a SDN Controller will
dynamically change the configuration or routing information of network devices.
For example, a new VM will advertise a new subnet or host routes when it is
spawned in a server, and this advertisement will be delivered to an SDN controller
via a southbound protocol. Accordingly, the SDN controller collects all the rout-
ing updates from the whole SDN cluster through the southbound interfaces, de-
cides the most current and best route entries, and then it reflects this information
to all the other network devices or VMs. This ensures all devices have the most up-
to-date routing information in real time. Examples of the most well-known south-
bound interfaces in the industry are, among others, OpenFlow, OVSDB, gRPC
and XMPP. OpenFlow and OVSDB are perhaps the most well-known southbound
interfaces. We’ll briefly introduce them.
OpenFlow
OpenFlow is a protocol that sends flow information into the virtual switch so the
switch can forward the packets between the different ports. Flows are defined
based on different criteria such as traffic between a source MAC address and a
destination MAC address, source and destination IP addresses, TCP ports,
VLANs, tunnels, and so on.
OpenFlow is one of the most widely deployed southbound standards from the
open source community. Martin Casado at Stanford University first introduced it
in 2008. The appearance of OpenFlow was one of the main factors that gave birth
to SDN.
20 Chapter 1: SDN Review
OpenFlow provides various information for the controller. It generates the event-
based messages in case of port or link changes. The protocol generates a flow-
based statistic for the forwarding network device and passes it to the controller.
OpenFlow also provides a rich set of protocol specifications for effective commu-
nication at the controller and switching element side. OpenFlow provides an open
source platform for the research community.
Every physical or virtual OpenFlow-enabled network (data plane) device in the
SDN domain needs to first register with the OpenFlow controller. The registration
process is completed via an OpenFlow HELLO packet originating from the Open-
Flow device to the SDN controller.
OVSDB
Unlike OpenFlow, Open vSwitch Database (OVSDB) is a southbound API de-
signed to provide additional management or configuration capabilities like net-
working functions. With OVSDB you can create the virtual switch instances, set
the interfaces and connect them to the switches. You can also provide the QoS pol-
icy for the interfaces. OVSDB sends and receives commands via JSON (JavaScript
Object Notation) RPCs.
Northbound Interface
The northbound interface provides connectivity between the controller and the
network applications running in the management plane. As already discussed, the
southbound interface has different available protocols, while the northbound in-
terface lacks such type of protocol standards. With the advancement of technol-
ogy, however, we now have a wide range of northbound API support like ad-hoc
APIs, RESTful APIs, etc. The selection of a northbound interface usually depends
on the programming language used in application development.
NFV stands for Network Function Virtualization, an operation framework for or-
chestrating and automating VNFs. VNF stands for virtual network function, such
as virtualized routers, firewalls, load balancers, traffic optimizers, IDS or IPS, web
application protectors, and so on.
In a nutshell, you can think of NFV as a concept or framework to virtualize certain
network functions, while VNF is the implementation of each individual network
function. Firewalls and load balancers are the two most common VNFs in the in-
dustry, especially for deployments inside data centers. When you read today’s docu-
ments about virtualization technology, you will see the terms in a pattern like
vXXX (e.g., vSRX, vMX), or cXXX (e.g., cSRX), where the letter “v” indicates it is
a virtualized product, while the letter “c” means containerized or its container
version.
OpenStack
Jointly launched by NASA and Rackspace in 2010, OpenStack has rapidly gained
popularity in many enterprise data centers. It is one of the most used open source
cloud computing platforms to support software development and big data analyt-
ics. OpenStack comprises a set of software modules such as compute, storage, and
networking modules, which work together to provide an open source choice for
22 Chapter 1: SDN Review
If you compare OpenStack with SDN, it’s easy to see that the two models share
some common features. Both provide a certain level of abstraction, hiding the low-
level hardware details and exposing upper level user applications. The differences
are somewhat subtle to describe in just a few words. First off, although there are
various distributions from different vendors, they share common core components
managed by the OpenStack Foundation. SDN is more of a framework or an ap-
proach to manage the network dynamically, which can be implemented with to-
tally different software techniques.
Second, from the perspective of technical ecological coverage, the ecological as-
pects of OpenStack are much wider because networking, along with various other
plugins, is just one of the services implemented by its Neutron component. In con-
trast, SDN and its ecology focus mainly on the networking. There are also
23 SDN Solutions Overview
differences in the way that Neutron works compared to how a typical SDN con-
troller works. OpenStack Neutron focuses on providing network services for vir-
tual machines, containers, physical servers, etc. and provides a unified northbound
REST API to users. SDN focuses on configuration and management of forwarding
control toward the underlaying network device. It not only provides user-oriented
northbound API, it also provides standard southbound APIs for communicating
with various hardware devices.
The comparison between OpenStack and SDN here is conceptual. In reality these
two models can, and in fact are, often coupled with each other in some way, loose-
ly or tightly. One example is Tungsten Fabric(TF), which we’ll talk about later in
this chapter.
Controllers
As mentioned previously, SDN is a networking solution that changes the tradi-
tional network architecture by bringing all control functionalities to a single loca-
tion and making centralized decisions. In this solution, SDN controllers are the
brain, which performs the control decision tasks while routing the packets. Cen-
tralized decision capability for routing enhances the network performance. As a
result, an SDN controller is the core component of any SDN solutions.
While working with SDN architecture, one of the major points of concern is which
controller and solution should be selected for deployment. There are quite a few
SDN controller and solution implementations from various vendors, and every
solution has its own pros and cons, along with its working domain. In this section
we’ll review some of the popular SDN controllers in the market, and the corre-
sponding SDN solutions.
OpenDaylight (ODL)
OpenDaylight, often abbreviated as ODL, is a Java-based open source project
started in 2013. It was originally led by IBM and Cisco but later hosted under the
Linux Foundation. It was the first open source controller that could support non-
OpenFlow southbound protocols, which made it much easier to integrate with
multiple vendors.
ODL is a modular platform for SDN. It is not a single piece of software. It is a
modular platform for integrating multiple plugins and modules under one um-
brella. There are many plugins and modules built for OpenDaylight. Some are in
production, while some are still under development.
24 Chapter 1: SDN Review
Some of the initial SDN controllers had their southbound APIs tightly bound to
OpenFlow, but as you can see from Figure 1.13, besides OpenFlow, many other
southbound protocols available in today’s market are also supported. Examples
are NETCONF, OVSDB, SNMP, BGP, and more. Support for these protocols is
done in a modular method in the form of different plugins, which are linked dy-
namically to a central component named the Service Abstraction Layer (SAL). SAL
does translations between the SDN application and the underlaying network
equipment. For instance, when it receives a service request from an SDN applica-
tion, typically via high level API calls (northbound), it understands the API call
and translates the request to a language that the underlying network equipment
can also understand. That language is one of the southbound protocols.
While this translation is transparent to the SDN application, ODL itself needs to
know all the details about how to talk to each one of the network devices it sup-
ports, their features, capabilities, etc. A topology manager module in ODL man-
ages this type of information. It collects topology related information from various
modules and protocols, such as ARP, host tracker, device manager, switch man-
ager, OpenFlow, etc., and based on this information, it visualizes the network to-
pology by dynamically drawing a diagram showing all the managed devices and
how they are connected together (see Figures 1.14 and 1.15).
25 SDN Solutions Overview
Any topology changes, such as adding new devices, will be updated in the data-
base and reflected immediately in the diagram.
As an SDN controller, ODL has a global view of the whole network, therefore it
has all the necessary visibility and knowledge of the network that can be used to
draw the network diagram in real time.
26 Chapter 1: SDN Review
BFD
NetFlow/sFlow
port mirroring
LACP
VXLAN
STP
IPv6
Besides the functions of traditional switches, the bigger advantage of OVS is that it
also has native support to the SDN solution via OVSDB and OpenFlow protocols.
That means any SDN controller can integrate OVS via these two open standard
protocols. Therefore OVS can work either as a standalone L2 switch within a hy-
pervisor host, or it can be managed and programmed via an SDN controller, such
as ODL. That is why it is used in so many open source and commercial virtualiza-
tion projects.
Calico
Here is a quote from the official Calico website:
Calico is an open source networking and network security solution for containers,
virtual machines, and native host-based workloads. Calico supports a broad range
of platforms including Kubernetes, OpenShift, Docker EE, OpenStack, and bare
metal services.
Calico has been an open-source project from day one. It was originally designed
for today’s modern cloud-native world and runs on both public and private
clouds. Its reputation derives from its deployment in Kubernetes and its ecosystem
27 Overview of Tungsten Fabric (TF)
environments. Today Calico has become one of the most used Kubernetes Contain-
er Network Interfaces (CNI) and many enterprises are using it at scale.
Compared with other overlay network SDN solutions, Calico is special in the sense
that it does not use any overlay networking design or tunneling protocols, nor does
it require NAT. Instead it uses a plain IP networking fabric to enable host-to-host
and pod-to-pod networking. The basic idea is to provide Layer 3 networking capa-
bilities and associate a virtual router with each node, so that each node behaves like
a traditional router or a virtual router. We know that a typical internet router relies
on routing protocols like OSPF or BGP to learn and advertise the routing informa-
tion, and that is the way a node in calico networking works. It chooses BGP as its
routing protocol because of its simplicity, the industry’s current best practice, and
the only protocol that sufficiently scales.
Calico uses a policy engine to deliver high-level network policy management.
VCP (Nuage)
The SDN platform offered by Nuage Networks (now Nokia) is the Virtualized
Cloud Platform (VCP). It provides a policy-based SDN platform that has a data
plane built on top of the open source OVS, and a closed source SDN controller.
The Nuage platform uses overlays to provide policy-based networking between dif-
ferent clouding environment (Kubernetes pods or non-Kubernetes environments
such as VMs and bare metal servers). It also has a real-time analytics engine to
monitor Kubernetes applications.
All components can be installed in containers. There are no special hardware
requirements.
Commercial Version
Juniper also maintains a commercial version of Contrail, and provides commercial
support to licensed users. Both the open source and commercial versions of Con-
trail provide the same full functionalities, features, and performances.
TF Architecture
TF consists of two main components:
Tungsten Fabric Controller: This is the SDN controller in the SDN architec-
ture.
Tungsten Fabric vRouter: This is the forwarding plane that runs in each
compute node performing packet forwarding and enforcing network and
security policies.
The communication between the controller and vRouters is via XMPP, a widely
used messaging protocol.
29 Overview of Tungsten Fabric (TF)
TF Controller Components
In each TF SDN Controller there are three main components, as shown in Figure
1.17:
Configuration nodes - These nodes keep a persistent copy of the intended
configuration states and store them in a Cassandra database. They are also
31 Overview of Tungsten Fabric (TF)
responsible for translating the high-level data model into a lower-level form
suitable for interacting with control nodes.
Control nodes - These nodes are responsible for propagating the low-level
state data it received from configuration node to the network devices and peer
systems in an eventually consistent way. They implement a logically central-
ized control plane that is responsible for maintaining the network state.
Control nodes run XMPP with network devices, and run BGP with each other.
Analytics nodes - These nodes are mostly about statistics and logging. They
are responsible for capturing real-time data from network elements, abstract-
ing it, and presenting it in a form suitable for applications to consume. It
collects, stores, correlates, and analyzes information from network elements.
32 Chapter 1: SDN Review
TF vRouter Components
The TF vRouter is the main forwarding module running in each compute node.
The compute node is a general-purpose x86 server that hosts tenant VMs running
customer applications.
The TF vRouter consists of two components:
The vRouter agent, which is the local control plane
In a typical configuration, Linux is the host OS and KVM is the hypervisor. The
Contrail vRouter forwarding plane can sit either in the Linux kernel space or in
the user space while running on DPDK mode. More details about this will be cov-
ered in later chapters in this book.
The vRouter agent is a user space process running inside Linux. It acts as the local,
lightweight control plane in the compute, in a way similar to what a routing engine
does in a physical router (see Figure 1.18). For example, vRouter agents establish
XMPP neighborships with two controller nodes, then exchanges the routing infor-
mation with them. The vRouter agent also dynamically generates flow entries and
injects them into the vRouter forwarding plane. This gives instructions to the
vRouter about how to forward packets.
The vRouter forwarding plane works like a line card of a traditional router (see
Figure 1.19). It looks up its local FIB and determines the next hop of a packet. It
also encapsulates packets properly before sending them to the underlay network
and decapsulates packets to be received from the underlay network.
We’ll cover more details of TF vRouter in later chapters.
SDN References
This has been a whirlwind tour of SDN so here are some additional references you
may find useful:
https://www.cs.princeton.edu/courses/archive/fall13/cos597E/papers/sdnhis-
tory.pdf
https://www.opennetworking.org/sdn-definition/
https://www.openvswitch.org/
Chapter 2
Virtualization Concepts
Server Virtualization
Kernel-based virtual machine (KVM) is an open source virtualization technology
built into Linux. It provides hardware assistance to the virtualization software,
using built-in CPU virtualization technology to reduce virtualization overheads
(cache, I/O, memory) and improve security.
QEMU is a hosted virtual machine emulator that provides a set of different hard-
ware and device models for the guest machine. For the host, QEMU appears as a
regular process with its own process memory scheduled by the standard Linux
scheduler. In the process, QEMU allocates a memory region that the guest sees as
physical and executes the virtual machine’s CPU instructions.
With KVM, QEMU can create a virtual machine that the processor is aware of and
that runs native-speed instructions with just virtual CPUs (vCPUs). When a special
instruction—like the ones that interact with the devices or special memory re-
gions—is reached by KVM, vCPU pauses and informs QEMU of the cause of
pause, allowing the hypervisor to react to that event.
Libvirt is an open-source toolkit that allows you to manage virtual machines and
other virtualization functionality, such as storage and network interface manage-
ment (see Figure 2.1). It proposes to define virtual components in XML-formatted
configurations that are able to be translated into the QEMU command line.
35 Server Virtualization
Interprocess Communication
Interprocess communication (IPC) is a mechanism which allows processes to com-
municate with each other and synchronize their actions. The communication be-
tween these processes can be considered as a method of cooperation between
them.
IPC is used in network virtualization in order to exchange data between different
distributed processes of a same application (for example, Virtio frontend and
backend, Contrail vRouter agent and data plane, etc.) or between processes of dis-
tinct applications (e.g., contrail vRouter and QEMU Virtio, Virtio and VFIO, and
so on).
Two different modes of communication are used for IPC:
Shared Memory: processes read and write information into a shared memory
region.
Message Passing: processes establish a communication link that will be used
to exchange messages.
36 Chapter 2: Virtualization Concepts
Shared Memory
The following scenario is used when shared memory is used for IPC:
First, a shared memory area is defined (shmget) with a key identifier known by
processes involved in the communication.
Second, processes attach (shmat) to the shared memory and retrieve a memory
pointer.
Then, processes read or write information in the shared memory using the
shared memory pointer (read/write operation).
Next, processes detach from the shared memory (shmdt)
shmdt: detach the process from the already attached shared memory segment.
Message Passing
Several message passing methods are available to exchange data information be-
tween processes:
eventfd: is a system call that creates an eventfd object (64-bit integer). It can be
used as an event wait/notify mechanism by user-space applications, and by the
kernel to notify user-space applications of events.
pipe (and named pipe) is a unidirectional data channel. Data written to the
write-end of the pipe is buffered by the operating system until it is read from
the read-end of the pipe.
Unix Domain Socket: domain sockets use the file system as their address
space. Processes reference a domain socket as an inode, and multiple processes
can communicate using a same socket. The server of the communication binds
a UNIX socket to a path in the file system, so a client can connect to it using
that path.
There are some other mechanisms that can be used by processes to exchange mes-
sages (shared file, message queues, network sockets, and signals system calls) but
they are not described in this book.
37 Server Virtualization
Each flow is using a well-defined path: a control path and a data path.
38 Chapter 2: Virtualization Concepts
Polling based packet processing is an alternate method (it is used by DPDK). All
incoming packets are copied transparently (without generating any interrupt) by
the NIC into a specific host memory area region (predefined by the application). At
a regular pacing, the network application is reading (polling) packets stored into
this memory area.
In the opposing direction, the network application is writing packets into the
shared memory area region. A DMA transfer is triggered to copy the packet from
the host memory to the NIC card buffers.
No interrupt is used with this method, but it requires the network application to
check at a regular interval whether a new packet has hit the NIC. This method is
well suited for high-rate packet processing: if packets are arriving at a slow rate
this algorithm is less efficient than the event-based method.
hardware-assisted emulation
Software-based emulation is widely supported but can suffer from poor perfor-
mance. Hardware-assisted emulation provides good performance thanks to its
hardware acceleration but requires hardware that supports some specific features.
Software-based Emulation
Two solutions are proposed for device virtualization with software:
Traditional Device Emulation (binary translation): the guest device drivers are
not aware of the virtualization environment. During runtime, the Virtual
Machine Manager (VMM), usually QEMU/KVM, will trap all the IO and
Memory-mapped I/O (MMIO) accesses and emulate the device behavior (trap
and emulate mechanism).The Virtual Machine Manager (VMM) emulates the
I/O device to ensure compatibility and then processes I/O operations before
passing them on to the physical device (which may be different). Lots of
VMEXIT (context switching) is generated with this method. It provides poor
performance.
Paravirtualized Device Emulation (virtio): the guest device drivers are aware
of the virtualization environment. This solution uses a front-end driver in the
guest that works in concert with a back-end driver in the VMM. These drivers
are optimized for sharing and have the benefit of not needing to emulate an
entire device. The back-end driver communicates with the physical device.
Performance is much better than with traditional device emulation.
40 Chapter 2: Virtualization Concepts
Hardware-assisted Emulation
A physical device is directly assigned to a single virtual machine. TTwo solutions
are proposed for device virtualization assisted with hardware direct assignment,
allowing a VM to directly access a network device. Thus the guest device drivers
can directly access the device configuration space to launch a DMA operation in a
safe manner, via IOMMU, for example.
The drawbacks are:
Direct assignment has limited scalability. A physical device can only be assigned
to a single VM.
IOMMU must be supported by the host CPU (Intel VT-d or AMD-Vi feature).
SR-IOV: with SR-IOV (Single Root I/O Virtualization), each physical device (physi-
cal function) can appear as multiple virtual ones (aka virtual function). Each virtual
function can be directly assigned to one VM, and this direct assignment uses the
VT-d/IOMMU feature.
The drawbacks are:
IOMMU must be supported by the host CPU (Intel VT-d or AMD-Vi feature).
SR-IOV must be supported by the NIC device (but also by the BIOS, the host
OS and the guest VM).
The control plane is used for capability exchange negotiation between the host and
guest, both for establishing and terminating the data plane. The data plane is used
for transferring the actual packets between the host and guest.
42 Chapter 2: Virtualization Concepts
Virtqueues are the mechanism for bulk data transport on virtio devices. They are
composed of:
guest-allocated buffers that the host interacts with (read/write packets)
descriptor rings
Tap devices are virtual point-to-point network devices that the user’s applications
can use to exchange L2 packets. Tap devices require the tun kernel module to be
loaded. Tun kernel modules create a kind of device in the /dev/net system directory
tree (/dev/net/tun). Each new tap device has a name in the /dev/net/tree filesystem.
vhost protocol
The vhost protocol was designed to address the virtio device transport backend
limitations. It’s a message-based protocol that allows the hypervisor to offload the
data plane to a handler. The handler is a component that manages virtio data for-
warding. The host hypervisor no longer processes packets.
The data plane is fully offloaded to the handler that reads or writes packets to/
from the virtqueues. The vhost handler directly accesses the virtqueues memory
region in addition to sending and receiving notification messages.
The vhost handler is made up of two parts:
vhost-net:
a kernel driver
uses irqfd and ioeventfd file descriptor to exchange notifications with the guest
vhost worker:
a Linux thread named vhost-<pid> (<pid> is the hypervisor process ID)
A tap device is still used to communicate the guest instance with the host, but the
virtio data plane is managed by the vhost handler and is no longer processed by the
hypervisor (see Figure 2.7). Guest instances are no longer stopped (context switch
with a VMEXIT) at each VirtIO packet transfer. New virtio vhost-net packet pro-
cessing backend is completely transparent to the guest who still uses the standard
virtio interface.
The VT-d or AMD IOMMU extensions must be enabled in BIOS in order to con-
duct device direct assignment. Two methods are supported:
PCI passthrough: PCI devices on the host system are directly attached to
virtual machines, providing guests with exclusive access to PCI devices for a
range of tasks. This enables PCI devices to appear and behave as if they were
physically attached to the guest virtual machine.
VFIO device assignment: VFIO improves on previous PCI device assignment
architecture by moving device assignment out of the KVM hypervisor and
enforcing device isolation at the kernel level.
With VFIO the physical device is exposed to the host user space memory and is
made visible from the guest VM it has been assigned to. See Figure 2.8.
46 Chapter 2: Virtualization Concepts
SR-IOV
The Single Root I/O Virtualization (SR-IOV) specification is defined by the PCI-
SIG (PCI Special Interest Group). This is a PCI Express (PCI-e) that extends a sin-
gle physical PCI function to share its PCI resources as separate virtual functions
(VFs).
The physical function contains the SR-IOV capability structure and manages the
SR-IOV functionality (it can be used to configure and control a PCIe device).
A single physical port (root port) presents multiple, separate virtual devices as
unique PCI device functions (up to 256 virtual functions – depending on device
capabilities).
Each virtual device may have its own unique PCI configuration space, memory-
mapped registers, and individual MSI-based interrupts (MSI: Message Signalled
Interrupts). Unlike a physical function, a virtual function can only configure its
own behavior. Each virtual function can be directly connected to a VM via PCI
device assignment (passthrough mode).
47 Server Virtualization
SR-IOV improves network device performance for each virtual machine as it can
share a single physical device between several virtual machines using the device
direct I/O assignment method (Figure 2.9).
With SR-IOV, each VM has direct access to the physical network using the as-
signed virtual function interface allocated to it. They can communicate altogether
using the Virtual Ethernet Bridge provided by the NIC card. A virtual switch can
also use SRIOV to get access to the physical network. A VM using a SR-IOV as-
signed virtual function device has direct access to the physical network and is not
connected to any intermediate virtual network switch or router (see Figure 2.10).
48 Chapter 2: Virtualization Concepts
The following command can be used to check whether SR-IOV is supported or not
on a physical NIC card:
Also, by providing direct access to the physical NIC, SR-IOV is making host vir-
tual network nodes (virtual router/switch) used by the SDN solution totally blind
about the VM using such connectivity. Local traffic switching between the VM
connected on the same SR-IOV physical card is achieved by the virtual Ethernet
bridge proposed by SR-IOV. Communication between VMs connected onto the
distinct SRIOV physical ports must rely on the physical network.
SDN vSwitch/vRouter usage is very limited with SR-IOV. Indeed, packet switch-
ing between VMs that are using VFs from the same SR-IOV physical port are us-
ing the physical virtual Ethernet bridge hosted in the physical NIC. See Figure
2.11.
Only a few use cases are relevant. The first provides internal connectivity be-
tween VMs using distinct SR-IOV physical ports connected to a Virtual Switch/
Router (it avoids sending the traffic out of the server to be processed by the physi-
cal network).
Figure 2.11 Connectivity Across Distinct Physical SRIOV Ports Using a Dual Homed Virtual
Switch/Router
A second use case is building hybrid mode solutions with multi-NIC VM. Net-
work traffic (management traffic, for instance) not requiring high performance
uses emulated NIC. Network connectivity requiring high performance is pro-
cessed by the SR-IOV assigned NIC (for instance, video data traffic). See Figure
2.12.
50 Chapter 2: Virtualization Concepts
With SR-IOV you get high performance but with poor flexibility and no network
virtualization features. With VirtIO you get a high level of network virtualization
suitable for SDN, which is very flexible with poor performances.
For SDN use cases, you need network virtualization features and performance.
DPDK will bring both.
Generic Linux Ethernet drivers are not powerful enough to process such a 10Gb/s
packet flow. Indeed, with regular Linux NIC drivers, a lot of time is required to
perform packet processing in Linux Kernel using interrupt mechanism, and trans-
fer application data from host memory to a NIC.
DPDK is one of the best solutions available as it allows you to build a network ap-
plication using high-speed NICs and work at wire speed. Therefore, Contrail is
proposing DPDK as one of the solutions to be used for the physical compute
connectivity.
Using the DPDK library API, physical NIC packets are made available in user
space memory in which the DPDK application is running. So, when DPDK is used,
there is no user space to kernel space context switching, which saves a lot of CPU
cycles. Also, the host memory is using a large continuous memory area. The huge
pages allow large data transfers and avoid high data fragmentation in memory,
which would require a higher memory management effort at the application level.
Such a fragmentation would also cost some precious CPU cycles.
Hence, most of the CPU cycles of the DPDK pinned CPU are used for polling and
processing packets delivered by the physical NIC in DPDK queues. As a result, the
packet forwarding task can be processed at a very high speed. If one CPU is not
powerful enough to manage incoming packets that are hitting the physical NIC at
a very high rate, you can allocate an additional one to the DPDK application in
order to increase its packet processing capacity.
A DPDK application is a multi-thread program that uses the DPDK library to pro-
cess network data. In order to scale, you can start several packet polling and pro-
cessing threads (each one pinned on a dedicated CPU) that are running in parallel.
Three main components are involved in a DPDK application (see Figure 2.15):
Physical NIC
DPDK Overview
DPDK is a set of data plane libraries and network interface controller drivers for
fast packet processing, currently managed as an open-source project under the
Linux Foundation. The main goal of the DPDK is to provide a simple, complete
framework for fast packet processing in data plane applications.
The framework creates a set of libraries for specific environments through the cre-
ation of an Environment Abstraction Layer (EAL), which may be specific to a
mode of the Intel architecture (32-bit or 64-bit), Linux user space compilers, or a
specific platform. These environments are created through the use of make files
and configuration files. Once the EAL library is created, the user may link with the
library to create their own applications.
53 Server Virtualization
The DPDK implements a “run to completion model” for packet processing, where
all resources must be allocated prior to calling data plane applications, running as
execution units on logical processing cores.
The model does not support a scheduler and all devices are accessed by polling.
The primary reason for not using interrupts is the performance overhead imposed
by interrupt processing.
For more information please refer to dpdk.org documents: http://dpdk.org/doc/
guides/prog_guide/index.html
A memory manager that allocates pools of objects in memory and uses a ring
to store free objects.
Poll mode drivers (PMD) that are designed to work without asynchronous
notifications, reducing overhead.
A packet framework made up of a set of libraries that are helpers to develop
packet processing.
In order to reduce the Linux user to kernel space context switching, all of these
functions are made available by DPDK into the user space where applications are
running. User applications using DPDK libraries have direct access to the NIC
cards, without passing through a NIC Kernel driver as is required when DPDK is
not used.
54 Chapter 2: Virtualization Concepts
User Application
User Application
User Space User Space
DPDK Libraries
DPDK allows you to build user-space multi-thread network applications using the
POSIX thread (pthread) library. DPDK is a framework made of several libraries:
Environment Abstraction Layer (EAL)
The ethdev library exposes APIs to use the networking functions of DPDK NIC
devices. The bottom half of ethdev is implemented by NIC PMD drivers. Thus,
some features may not be implemented.
Poll Mode Ethernet Drivers (PMDs) are a key component for DPDK. These PMDs
bypass the kernel and provide direct access to the Network Interface Cards (NIC)
used with DPDK.
55 Server Virtualization
Linux user space device enablers (UIO or VFIO) are provided by the Linux Kernel
and are required to run DPDK. They allow PCI devices to discover and expose in-
formation and address space through the /sys directory tree.
DPDK libraries (See Figure 2.17) allow kernel-bypass application development:
probing for PCI devices (attached via a Linux user space device enabler),
huge-page memory allocation
Only a few libraries have been described in this diagram. The set of libraries is en-
riched with each new DPDK release: see https://www.dpdk.org/).
56 Chapter 2: Virtualization Concepts
Atomic/lock operations
Time reference
Interrupt handling
Alarm operations
The mbufs are storing DPDK NIC incoming and outgoing packets which have to
be processed by the DPDK application.
Packet Descriptors
DPDK queues are not storing the packets but a descriptor points to the real packet,
as seen in Figure 2.20. It avoids performing a data transfer that would be needed
when packets have to be forwarded from one DPDK NIC to another.
Packets are not moved from one queue to another, these are the descriptors (point-
ers) that are moving from one queue to another, as seen in Figure 2.21.
DPDK Rings
Descriptors are set up as a ring, or a circular array of descriptors. Each ring de-
scribes a single direction DPDK NIC queue. Each DPDK NIC queue is made up of
two rings (one per direction: one RX ring, one TX ring). See Figure 2.22.
Each descriptor points onto a packet that has been received (RX ring) or that is
going to be transmitted (TX ring). The more descriptors RX/TX rings contain, the
more memory required to store data in each mempool (number of mbufs).
RX Queue RX FIFO
Huge Pages Memory
mbuf
M Device
E mbuf Ring
M
P mbuf
O ...
O
L mbuf
DMA transfer transparently copies packets from physical NIC memory to the host
central memory. DMA uses the RDT descriptor as a destination memory address
for the data to be transferred. Once packets have been transferred into host memo-
ry both RX rings and RDT are updated.
TX Queue TX FIFO
Huge Pages Memory
mbuf
M Device
E mbuf Ring
M
P mbuf
O ...
O
L mbuf
Synchronization between the host OS and the NIC happens through two registers,
whose content is interpreted as an index in the TX ring:
Transmit Descriptor Head (TDH): indicates the first descriptor that has been
prepared by the OS and has to be transmitted on the wire.
Transmit Descriptor Tail (TDT): indicates the position to stop transmission,
i.e. the first descriptor that is not ready to be transmitted, and that will be the
next to be prepared.
Linux pthreads
Multithreading is the ability of a CPU (single-core in a multi-core processor archi-
tecture) to provide multiple threads of execution concurrently. In a multithreaded
application, the threads share some CPU resources memory:
CPU caches
A single Linux process can contain multiple threads, all of which are executing the
same program. These threads share the same global memory (data and heap seg-
ments), but each thread has its own stack (local variables).
Linux pthreads (POSIX threads) is a C library containing set functions that allow
managing threads into an application. DPDK uses the Linux pthreads library.
DPDK lcores
DPDK uses threads that are designed as “lcore.” A lcore refers to an EAL thread,
which is really a Linux pthread, which is running onto a single processor execution
unit:
first lcore: that executes the main() function and that launches other lcores is
named master lcore.
any lcore: that is not the master lcore is a slave lcore.
Lcores are not sharing CPU units. Nevertheless, if the host processor supports hy-
perthreading, a core may include several lcores or threads. Lcores are used to run
DPDK application packet processing threads. Several packet processing models
are proposed by DPDK. The simplest one is the run-to-completion model, shown
in Figure 2.25.
62 Chapter 2: Virtualization Concepts
Run to completion uses a single thread (lcore) for end-to-end packet processing
(packet polling, processing and forwarding).
For instance, Contrail DPDK vRouter is using such a model for GRE encapsulated
packet processing.
Control Threads
It is possible to create control threads which can be used for management/infra-
structure tasks and are used internally by DPDK for multi process support and in-
terrupt handling.
Service Core
DPDK service cores enable a dynamic way of performing work on DPDK lcores.
Service core support is built into the EAL and an API is provided to optionally al-
low applications to control how the service cores are used at runtime.
63 Server Virtualization
Linux packet processing with sockets API requires the following operations, which
can be costly:
Kernel Linux system calls
With usual Linux drivers most operations occur in kernel modes and require lots
of user space to kernel space context switching and interruption mechanisms. The
heavy context switching usage costs lots of CPU cycles and limits the numbers of
packets that a CPU is able to process. Such drivers are not able to perform packet
processing at expected high speeds, especially when 10/40/100G Ethernet genera-
tion cards are used on a Linux System.
use Linux user space device enablers (UIO or VFIO) driver for specific control
changes (interrupts configuration)
Hence user applications can directly configure the NIC cards they are using from
Linux user space where they are running.
A first configuration phase is using PMDs and the DPDK library to configure
DPDK rings buffers into Linux user space. Next, incoming packets will be auto-
matically transferred with DMA (direct memory access) mechanism from the NIC
physical RX queues in NIC memory to DPDK RX rings buffered in host memory.
DMA is also used to transfer outgoing packets from the DPDK TX rings buffer in
host memory to the NIC physical TX queues in NIC memory. DMA offloads ex-
pensive memory operations, such as large copies or scatter-gather operations, from
the CPU.
65 Server Virtualization
IOMMU
Input/output memory management unit (IOMMU) is a memory management unit
(MMU) that connects a DMA capable I/O bus to the main memory. See Figure
2.29.
In virtualization, an IOMMU re-maps the addresses accessed by the hardware
into a similar translation table that is used to map guest virtual machine address
memory to host-physical addresses memory.
66 Chapter 2: Virtualization Concepts
IOMMU provides a short path for devices to get access only to a well-scoped
physical device memory area which corresponds to a given guest virtual machine
memory. IOMMU helps to prevent DMA attacks that could originate via mali-
cious devices. IOMMU provides DMA and interrupt remapping facilities to ensure
I/O devices behave within the boundaries they’ve been allotted.
Intel has published a specification for IOMMU technology: Virtualization Tech-
nology for Directed I/O abbreviated as VT-d.
In order to get IOMMU enabled:
both kernel and BIOS must support and be configured to use IO virtualization
(such as VT-d).
IOMMU must be enabled into Linux Kernel parameters in /etc/default/grub
and run the update-grub command.
GRUB configuration example with IOMMU Passthrough enabled:
GRUB_CMDLINE_LINUX_DEFAULT="iommu=pt intel_iommu=on"
Different PMDs may require different kernel drivers in order to work properly (cf
Linux user space device enablers). Depending on the PMD being used, a corre-
sponding kernel driver should be loaded and bound to the network ports.
It is also preferable that each NIC has been flashed with the latest version of NVM
(Non-Volatile Memory)/firmware.
UIO only supports legacy interrupts so it is not usable with SR-IOV and virtual
hosts that require MSI/MSI-X interrupts.
68 Chapter 2: Virtualization Concepts
Despite these limitations, UIO is well suited for use in VMs, where direct IOMMU
access is not available. In such a situation, a guest instance user space process is
not isolated from other processes in the same instance. But the hypervisor can iso-
late any guest instance from others or hypervisor host processes using IOMMU.
Currently, two UIO modules are supported by DPDK:
Linux Generic (uio_pci_generic), which is the standard proposed UIO module
included in the Linux kernel.
DPDK specific (igb_uio), which must be compiled with the same kernel as the
one running on the target.
DPDK specific UIO Kernel module is loaded with insmod command after UIO
module has been loaded:
$ sudo modprobe uio
$ sudo insmod kmod/igb_uio.ko
DPDK specific UIO module could be preferred in some situation to Linux Generic
UIO module (cf: https://doc.dpdk.org/guides/linux_gsg/linux_drivers.html).
Eventfd and irqfd based signaling mechanism to support events and interrupts
from and to the user space application.
VFIO exposes APIs that allow:
MMU subsystem is used to place devices into IOMMU groups. User space pro-
cesses can open these IOMMU groups and register memory with the IOMMU for
DMA access using VFIO ioctl calls. VFIO also provides the ability to allocate and
manage message-signaled interrupt vectors.
A single command is needed to load VFIO module:
$ sudo modprobe vfio_pci
Despite the fact that VFIO was created to work with IOMMU, VFIO can be also
be used without it (this is just as unsafe as using UIO).
Cached memories are shared between the cores of a single CPU. Typical characteris-
tics of memory cache are:
Accessing a Level 1 cache takes 7 CPU cycles (with a size of 64KB or 128KB).
If the CPU needs to access data that is in the main RAM, it has to use its RAM
controller.
Access to RAM typically takes 170 CPU cycles (the green line in Figure 2.32). Access
to the remote RAM through the remote RAM controller typically adds 200 cycles
(the red line in Figure 2.32), meaning RAM latency is roughly doubled.
When data needed by the CPU is located in both the local and the remote RAM with
no particular structure, latency to access data can be unpredictable and unstable.
Hyper-threading (HT)
A single physical CPU core with hyper-threading appears as two logical CPUs to an
operating system. While the operating system sees two CPUs for each core, the ac-
tual CPU hardware has only a single set of execution resources for each core. Hyper-
threading allows the two logical CPU cores to share physical execution resources.
The sharing of resources allows two logical processors to work with each other
more efficiently and allows a logical processor to borrow resources from a stalled
logical core (assuming both logical cores are associated with the same physical core).
Hyper-threading can help speed up processing, but it’s nowhere near as good as hav-
ing actual additional cores.
Huge Pages
Memory is managed in blocks known as pages. On most systems, a page is 4KB, and
1MB of memory is equal to 256 pages; 1GB of memory is 256,000 pages, etc. (See
Figure 2.34. ) CPUs have a built-in memory management unit that manages a list of
these pages in hardware.
Virtual memory address lookup slows down when the number of entries increases.
A huge page is a memory page that is larger than 4KB. (See Figure 2.34.) In
x86_64 architecture, in addition to standard 4KB memory page size, two larger
page sizes are available: 2MB and 1GB. Contrail DPDK vRouter can use both or
only one huge page size.
Isolation and pinning are two complementary mechanisms that are proposed by
Linux OS:
CPU isolation restricts the set of CPUs that are available for operating system
scheduler level. When a CPU is isolated, no task will be scheduled on it by the
OS. An explicit task assignment must be completed.
CPU pinning is also called processor affinity and it enables the binding and
unbinding of a process or a thread onto the CPU. On the opposite side, CPU
pinning is a mechanism that consists in defining a limited set of CPUs that are
allowed to be used by:
The OS scheduler. OS CPU affinity is managed through system.
A specific process: using CPU pinning rules (taskset command for in-
stance).
Tasks to be run by an operating system must be spread across available CPUs.
These tasks are in a multi-threading environment often made of several processes
which are also made of several threads.
Isolcpus
Isolcpus is a kernel scheduler option. When a CPUs is specified in isolcpus list, it is
removed from the general kernel SMP balancing and scheduler algorithms. The
only way to move a process onto or off of an isolated CPU is via the CPU affinity
syscalls (or to use the taskset command).
This isolation mechanism:
removes isolated CPUs from the common CPU list used to process all tasks
it is not possible to rearrange the CPU isolation rules after the system startup.
the only way to change an isolated CPU list is by rebooting with a different
isolcpus value in the boot loader configuration (GRUB for instance).
isolcpus disables the scheduler load balancer for isolated CPUs. This also
means the kernel will not balance those tasks equally among all the CPUs
sharing the same isolated CPUs (having the same affinity mask).
75 Server Virtualization
CPU Shield
The cgroups subsystem proposes a mechanism to dedicate some CPUs to one or
several user processes. It consists of defining a user shield group which protects a
subset of CPU system tasks.
The definition of three cpusets consists of:
root: present in all configurations and contains all CPUs (unshielded)
system: containing CPUs used for system tasks - the ones which need to run
but aren’t important (unshielded)
user: containing CPUs used for tasks that you want to assign a set of CPUs for
their exclusive use (shielded)
The CPU shield can be manipulated with the cset shield command.
Tuned
Tuned is a system tuning service for Linux. Tuned uses Tuned profiles to describe
Linux OS performance tuning configuration.
The cpu-partitioning profile partitions the system CPUs into isolated and house-
keeping CPUs. This profile is intended to be used for latency-sensitive workloads.
NOTE Currently, Tuned is only supported on Linux Red Hat OS family. See :
https://tuned-project.org/.
# vi /etc/systemd/system/<my service>.service
...
[Service]
CPUAffinity=<CPU mask>
If a specific CPU affinity has been defined for a given service, it has to be restarted
in order for the new configuration file to be taken into consideration.
Or set it:
# taskset -p mask pid
You can see the kernel-mode VM connected to a DPDK compute application. The
user application is using both:
the vhost user library: for emulated PCI NIC control plane
Support for user space vhost has also been provided with QEMU 2.1.
Virtual IOMMU
Virtual IOMMU (vIOMMU) allows emulation of the IOMMU for guest VMs.
vIOMMU has the following characteristics:
Translates guest VM I/O Virtual Addresses (IOVA) to Guest Physical Address-
es (GPA).
Guests VM Physical Addresses (GPA) are translated to Host Virtual Addresses
(HVA) through the hypervisor memory management system.
Performs device isolation.
The integration between the virtual IOMMU and any user space network applica-
tion like DPDK is usually done through the VFIO driver. This driver performs de-
vice isolation and automatically adds the memory (IOVA to GPA) mappings to the
virtual IOMMU.
The use of huge pages memory in DPDK contributes to optimize TLB lookups,
since fewer memory pages can cover the same amount of memory. Consequently,
the number of device TLB synchronization messages drops dramatically. Hence,
the performance penalty from TLB lookups is lowered, see: https://www.redhat.
com/en/blog/journey-vhost-users-realm, and https://wiki.qemu.org/Features/VT-d.
78 Chapter 2: Virtualization Concepts
Vhost user protocol moves the virtio ring from kernel all the way to user space.
The ring is shared between the guest and DPDK application. QEMU sets up this
ring as a control plane using UNIX sockets.
If both the host server and the guest VM are DPDK, there are no VMExits in the
host for guest packets processing. Guest virtual machines use the virtio-net PMD
driver, which performs packets polling. There is nothing running in kernel here, so
there are no system calls. Since both system calls and VM Exits are avoided, the
performance boost is significant.
Figure 2.39 Physical Device Assigned to a Guest VM protected by both IOMMU and vIOMMU
Physical incoming packets are directly copied in the guest memory without involv-
ing the host server. SR-IOV only allows sharing a physical NIC between several
guests, creating Virtual Function dedicated to a single guest; but does not change
the packet processing path provided by PCI passthrough mechanism.
By leveraging the VFIO driver in the host kernel we can provide a direct access to
an assigned SRIOV virtual function, with the guest memory protected by IOMMU
(Figure 2.40).
82 Chapter 2: Virtualization Concepts
has good hardware performance, like that proposed by SRIOV and direct
physical device assignment
can be used in SDN, like proposed by Virtio software solution.
Hence once the guest memory is mapped with the NIC using Virtio physical device
passthrough, the guest communicates directly with the NIC via PCI without in-
volving any specific drivers in the host kernel.
Guest VM packet processing is directly performed in NIC hardware but presented
to the guest instance like a regular Virtio-emulated interface. Guest VM does not
make any difference between a Virtio-emulated interface and an assigned physical
Virtio NIC, as they are exposed with the same Virtio driver frontend in the guest.
See Figure 2.41.
83 Run DPDK in a Guest VM
Virtio driver in the guest decoupled from any vendor implementation for the con-
trol path.
vDPA presents a generic control plane through software which provides an ab-
straction layer on top of physical NIC. Like Virtio full hardware offloading, vDPA
builds a direct data path between the guest network interface and the physical
NIC, using the Virtio ring layout. But the control path a generic vDPA driver (me-
diation driver) is used to translate the vendor NIC driver/control plane to the
Virtio control plane in order to allow each NIC vendor to keep using its own
driver.
vDPA allows NIC vendors to support Virtio ring layout with smaller effort keep-
ing wire speed performance on the data plane. See Figure 2.43.
Smart NIC
A NIC card generator, commonly named smart NIC, is highly customizable
thanks to the last evolution provided by some new capabilities (FPGA, ARM, P4).
It’s now possible to envisage SDN vSwitch/vRouter data plane functions to be
moved onto the NIC card keeping only the control plane function in the host oper-
ating system.
For Contrail solutions, this is made by offloading several Contrail vRouter tables
including:
Interface Tables
Next Hop Tables
87 Run DPDK in a Guest VM
IPv4 FIB
IPv6 FIB
L2 Forwarding Tables
Flow Tables
It permits accelerating lookups and forwarding actions that are directly performed
into the NIC, as shown in Figure 2.45.
You can see in Figure 2.45 that SDN packet processing is fully completed on the
NIC card and no more host CPU processing is involved in packet processing.
Two implementations are proposed by Netronome are SRIOV + SmartNIC, and
XVIO + Smart NIC, as shown in Figures 2.46 and 2.47.
88 Chapter 2: Virtualization Concepts
XDP support is made available in the Linux kernel since version 4.8, while eBPF is
supported in the Linux kernel since version 3.18.
90 Chapter 2: Virtualization Concepts
XDP requires:
MultiQ NICs
BPF programs:
parse packets
Drop
Normal receive (regular Linux packet processing with socket buffer and TCP/
IP stack)
Generic Receive Offload (coalesce several received packets of a same connec-
tion)
XDP is also able to offload an eBPF program to a NIC card which supports it, re-
ducing the CPU load. See Figure 2.49.
91 Run DPDK in a Guest VM
NOTE eBPF rules are also supported in DPDK application. See: https://www.
redhat.com/en/blog/using-express-data-path-xdp-red-hat-enterprise-linux-8.
92 Chapter 2: Virtualization Concepts
In the middle of this matrix chart, SmartNIC and DPDK offer the best compro-
mise for SDN usage. Smart NICs propose very high performance, but this is still
not a fully mature solution (lots of implementations vendor specific, but there is no
agreed standards). Figure 2.51 lists the feature sets of this chapter’s examinations.
93 NIC Virtualization Solutions Summary
(*): depends on hardware and QEMU latest virtio specification support on the NIC card.
Contrail vRouter
You can see there is an orchestrator at the top of Figure 3.1 that can be OpenStack
or Kubernetes. Below that, there are controller components like control node, con-
fig node, and analytics node. At the bottom are the compute nodes. Compute
nodes are general purpose x86 servers and they are the main focus of this chapter.
Figure 3.2 shows a more detailed view of the compute node. This is the place
where vRouter runs. It is the most important component of the Contrail data
plane. You can see some workloads running, and they can be either VMs or con-
tainers. These workloads have their interfaces plumbed into the vRouter.
At a high level, vRouter forms dynamic overlay tunnels with other workloads run-
ning on the same or different computes to send and receive data traffic. Within the
server, it switches the packets between the VM interfaces and the physical inter-
faces after doing the required encapsulations or decapsulations. Currently, the en-
capsulation protocols supported by vRouter are MPLS over UDP (MPLSoUDP),
MPLS over GRE (MPLSoGRE), and VXLAN. Each of these workloads have a
corresponding forwarding state or routing instance inside vRouter which it uses to
switch the packets. The physical interface that is connected to the top-of-rack
(TOR) switch can be single or bonded mode.
The vRouter itself can be running either as a Linux kernel module or as a user
space DPDK process. There is a vRouter agent process also running in user space.
The agent has a connection to the controller using a XMPP channel, which is used
to download configurations and forwarding information. The main job of the
agent is to program this forwarding state to the vRouter forwarding plane.
96 Chapter 3: Contrail DPDK vRouter Architecture
vRouter Architecture
The vRouter is the workhorse of the Contrail system. Each and every packet to and
from the Contrail cluster goes through the vRouter. The vRouter is high perfor-
mance, efficient, and has the capability to process millions of packets per second. It
is multi-threaded, multi-cored, and multi-queued to achieve maximum parallelism
and exploit the x86 hardware to the maximum extent.
To support the rich and diverse features, vRouter has a sophisticated packet process-
ing pipeline. The same pipeline can be stitched by the vRouter agent process from
the simplest to the most complicated manner depending on the treatment that needs
to be given to a packet. vRouter maintains multiple instances of forwarding bases.
All the table accesses and updates use RCU (Read Copy Update) locks.
vRouter Interfaces
Figure 3.3 depicts the vRouter and its interfaces. Each of these vRouter interfaces is
called a vif or vRouter interface. There are interfaces to each of the workloads
(VM1, VM2, VMn) that it manages. These are typically tap interfaces.
To send packets to other physical servers or switches, vRouter uses the physical in-
terfaces. They can be single or bonded NICs. vRouter is only interested in overlay
packets or the packets to and from the workloads. For other packets, it uses the
Linux interface to send them to the host OS.
This Linux interface is called vhost0. It also has Netlink interfaces toward the
vRouter agent to download the forwarding state and also to send and receive some
exception packets. The name of the later is called pkt0 interface.
This is sample output from the vif --list command, which provides the list of all
vifs that are configured on a compute node:
97 Contrail Software Stack
vif0/2 Socket: unix
Type:Agent HWaddr:00:00:5e:00:01:00 IPaddr:0.0.0.0
Vrf:65535 Mcast Vrf:65535 Flags:L3Er QOS:-1 Ref:3
RX port packets:135922 errors:0
RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 0 0
RX packets:135922 bytes:11689292 errors:0
TX packets:36432 bytes:3198966 errors:0
Drops:0
There are various tables and engines in action in this pipeline. Some of the impor-
tant tables are flow table, route table, NH table, and the MPLS/VXLAN table. The
vRouter agent programs these tables based on the forwarding state it receives from
the control node and also based on its own internal processing. Each packet, de-
pending on which interface it is coming from, is subjected to the desired processing.
At a high level, all packets enter from a vif interface. The vifs are nothing but one of
the vRouter interfaces described previously, for example: tap interface, physical
interface, vhost0 interface, agent interface, etc. Depending upon the configuration
of that interface, packets enter different pipeline stages, doing lookups in different
tables based on what actions are defined in each stage, and the packets are modified
accordingly.
At the end of the processing, it is sent to another vRouter interface or vif after en-
capsulation or decapsulation. This is a fairly generic pipeline and the agent stitches
this based on the rich feature set that the Contrail cluster can configure.
Another important aspect of vRouter is that of forwarding modes. The vRouter
can work in two modes - flow mode (bottom pipeline in Figure 3.4) or packet mode
99 Contrail Software Stack
(top of Figure 3.4). By default, Contrail works in flow mode. This means that
vRouter keeps track of every single flow traversing it. Depending on the flow ac-
tion, it can either forward the packet or drop it. In the packet mode, the vRouter
bypasses the flow table and directly uses the next hop for treatment that needs to
be given to the packet. For example, if the next hop is a tunnel next hop, the pack-
et is encapsulated in a tunnel header and forwarded to an outgoing interface.
Linux Kernel
In this deployment, vRouter is installed as a kernel module (vrouter.ko) inside the
Linux OS, as seen in Figure 3.5. This is the default installation mode when config-
uring a compute node. vRouter registers itself with the Linux TCP/IP stack to get
packets from any of the Linux interfaces that it wants to. It uses the netdev_rx_
handler_register() API provided by Linux for this purpose. The interfaces can be
bond, physical, tap (for VMs), veth (for containers) etc. It relies on Linux to send
and receive packets from different interfaces. For example, Linux exposes a tap
interface backed by a vhost-net driver to communicate with VMs. Once vRouter
registers for packets from this tap interface, the Linux stack sends all the packets
to it. To send a packet, vRouter just has to use regular Linux APIs like dev_queue_
xmit() to send the packets out on a Linux interface.
NIC queues (either physical or virtual) are handled by Linux OS. With respect to
packet processing performance, the tuning has to be done at that Linux OS level.
See Figure 3.6.
Figure 3.6 Kernel vRouter Interfaces with Other Components In the Compute Node
In Figure 3.6, packet processing works in interrupt mode. This mode generates in-
terrupts, which results in lot of context switches. When the packet flow rate is low,
it works well. But as soon as the packet rate starts increasing, the system gets over-
whelmed with the number of interrupts generated, resulting in poor performance.
DPDK
In this mode, vRouter runs as a user space application that is linked to the DPDK
library. This is the performance version of vRouter that is commonly used by telcos,
where the VNFs themselves are DPDK-based applications. The performance of
vRouter in this mode is more than ten times higher than the kernel mode. The phys-
ical interface is used by DPDK’s poll mode drivers (PMDs) instead of Linux kernel’s
interrupt-based drivers.
A user-IO (UIO) kernel module like vfio or uio_pci_generic is used to expose a
physical network interface’s registers into user space so that they are accessible by
DPDK PMD. When a NIC is bound to a UIO driver, it is moved from Linux kernel
101 Contrail Software Stack
space to user space and therefore no longer managed nor visible by the Linux OS.
Consequently, it is the DPDK application (which is the vRouter here) that fully
manages the NIC. This includes packets polling, packets processing, and packets
forwarding. No further action is taken by the operating system. All user packet
processing steps are performed by the vRouter DPDK data plane. See Figure 3.7.
The nature of this “polling mode” makes the vRouter DPDK data plane packet
processing/ forwarding much more efficient as compared to the interrupt mode
when the packet rate is high. There are no interrupts and context switching during
packet IO.
NOTE When the network packet rate is low, this way of working could be less
efficient than the regular Kernel mode. In DPDK mode, a set of CPUs are fully
dedicated for packet processing purposes and are always polling even in the
absence of packets. If the network packets rate is too low, a lot of CPU cycle are
unused and wasted. However, there is an inbuilt optimization technique that kicks
in that yields the CPU for a small amount of time when there are no packets in the
previous polling interval.
Finally, since the DPDK vRouter does not require any support from Linux kernel,
it needs to be heavily tuned to get the best packet processing performance.
In this chapter we focus on the architecture of DPDK vRouter (see Figure 3.8).
102 Chapter 3: Contrail DPDK vRouter Architecture
Figure 3.8 DPDK vRouter Interfaces with Other Components In the Compute Node
SmartNIC
In this mode, the Contrail vRouter runs inside the NIC card itself (SmartNIC) as
shown in Figure 3.9. This means, compute host resources are not involved in
packet processing. It saves the CPU resources that will be used by vRouter for
packet processing. Since all the packet processing is done by the NIC hardware,
the performance is the best compared to the previous two types of deployments.
Forwarding lcores
Service lcores
tapdev lcore
timer lcore
uvhost lcore
packet (Pkt0) lcore
netlink lcore
Forwaring lcores
Forwarding lcores are responsible for polling the physical and virtual interfaces.
Physical interfaces can be a bonded interface too. In addition, they can do the
vRouter packet processing, which is briefly illustrated in this chapter’s section,
“vRouter packet processing Pipeline.” These lcores can assume the role of both
polling and processing.
These lcores are spawned by the vRouter with a well-defined CPU list. It gets the
CPU list as mask (core mask) using the taskset Linux command:
taskset 0x1e0 /usr/bin/contrail-vrouter-dpdk --no-daemon
CPU number 0 0 0 5 4 3 2 1 0
Bit value 0 0 0 0 1 1 1 1 0
This will make the vRouter spawn four forwarding cores and they will be pinned
to CPU numbers 1,2,3,4.
The first forwarding lcore is named as lcore10, the next one is named as lcore11,
and so on. Hence if a DPDK vRouter has been configured with four forwarding
lcores onto its CPU list, four pthreads will be launched: lcore10, lcore11, lcore12,
and lcore13.
Here is the output which lists the threads running in vRouter, its names and also its
PIDs:
[root@a7s4 ~]# ps -T -p $(pidof contrail-vrouter-dpdk)
PID SPID TTY TIME CMD
3685 3685 ? 03:47:37 contrail-vroute >>> Main thread and tuntap lcore
3685 3800 ? 00:04:32 eal-intr-thread >>> DPDK library control thread
3685 3801 ? 00:00:00 rte_mp_handle >>> DPDK library control thread
3685 3802 ? 04:55:48 lcore-slave-1 >>> Timer lcore
3685 3803 ? 00:00:02 lcore-slave-2 >>> uvhost lcore
3685 3804 ? 00:00:11 lcore-slave-8 >>> Packet (pkt0) lcore
3685 3805 ? 00:04:12 lcore-slave-9 >>> netlink lcore
3685 3806 ? 6-16:39:37 lcore-slave-10 >>> forwarding thread #1
3685 3807 ? 6-16:40:48 lcore-slave-11 >>> forwarding thread #2
105 DPDK vRouter Architecture
Pipeline model
Hybrid model
In the run-to-completion model, the software does not have multiple stages and it
does the entire processing in a single context or single stage. There is no inter-
threads packet buffering, hence latency overheads are less.
In the pipeline model, the software is divided into multiple stages. Each stage com-
pletes part of the processing and hands it over to the next stage, and so on. It’s
handed over by using a FIFO buffer between the stages. These buffers introduce
latency but the main advantage of this model is that it ensures even load balancing
of all stages in the event when only a few stages are loaded.
Contrail vRouter uses a hybrid model where it employs a pipelining model in some
scenarios and a run-to-completion model in others. This ensures good load bal-
ancing of all lcores with a reasonable latency. It needs to have FIFO buffers due to
pipelining.
The vRouter performs the following types of packet processing:
Run-to-completion: A forwarding lcore polls for packets from a vif Rx queue.
Then it performs the vRouter packet processing and determines the encap/
decap that needs to be done. It also finds which outgoing vifs the modified
packets needs to be sent. Finally, it sends them on those outgoing vif Tx
queues.
Pipeline: A first forwarding lcore polls for packets from a vif Rx queue. It then
distributes these packets to another forwarding lcore using the DPDK soft-
ware rings (FIFO buffer) between them. The other forwarding lcores pick up
the packets and perform the packet processing. Then they send the modified
packets to other vif Tx queues.
vRouter uses the Run-to-completion model in following scenarios:
If the option --vr_no_load_balance is configured, all packets in any direction
are processed using this model.
Without that option, only the packets coming from the physical NIC encapsu-
lated in MPLSoUDP or VXLAN are processed using this model.
106 Chapter 3: Contrail DPDK vRouter Architecture
Service lcores
Service lcores are responsible for tasks other than packet forwarding, They handle
all vRouter interfaces (vRouter ports) tasks like port setup, workload interface to
vRouter port binding, routing information propagation, etc.
It also handles other book-keeping and miscellaneous tasks for vRouter like timer
management and Virtio (vhost-user) control paths. By default, these lcores are not
pinned to any physical CPU.
Most of the service lcores make use of user sockets to talk to other processes using
Inter Process Communication (IPC). Some of the entities they communicate with
are vRouter agent, Qemu (Virtual Machine), and Linux stack.
Tapdev lcore
The DPDK vRouter needs to be able to exchange packets with the Linux kernel
networking stack. The vhost0 interface is a network interface which is shared by
both the vRouter application and other Linux applications. For instance, on a sin-
gle network interface compute, the physical IP of the compute is migrated onto the
vhost0 interface. This IP is used by the SSH server daemon and can't be migrated
into the DPDK application (vRouter data plane). See Figure 3.11.
The DPDK control plane provides two solutions that allow user space applications
to exchange packets with the kernel networking stack:
Kernel NIC Interface (KNI)
tuntap Interface
The vRouter implements a custom PMD for tuntap devices that can be used to
send and receive packets between the vRouter and the Linux host OS kernel.
Currently “vhost0” and “monitoring” interfaces (used by vifdump utility, which is
explained later) make use of it.
When a tap device is initialized, vRouter uses the “tun” driver (/dev/net/tun) in
Linux and creates a tuntap device:
[root@a7s3 ~]# ethtool -i vhost0
driver: tun
version: 1.6
firmware-version:
expansion-rom-version:
bus-info: tap
supports-statistics: no
supports-test: no
108 Chapter 3: Contrail DPDK vRouter Architecture
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no
Once the Netlink communication channel between the vRouter agent and vRouter
DPDK data plane has been set up using the Netlink lcore, the agent sends a mes-
sage to the vRouter DPDK to add the vhost0 interface, as shown in Figure 3.12.
As part of this sequence, a new vhost0 vif or vif0/1 is created and is set up so that
the tapdev lcore is responsible for polling the vHost0 interface. The vhost0 is the
Linux network interface used by the vRouter agent to send XMPP packets to Con-
trail Control nodes.
In each iteration, the PMD uses raw “read” and “write” socket calls to receive and
transmit packets to the tuntap device.
One of the forwarding cores will be assigned to process the vhost0 packets and
will be polling a dedicated DPDK ring, called the tapdev_rx_ring. This ring will be
added to the forwarding lcore’s poll list when the vhost vif is added by the vRouter
agent.
The tapdev PMD receives packets from the vhost0 interface using the “read()”
socket call and enqueues them to the above mentioned DPDK ring, as shown in
Figure 3.13. One forwarding core is designated to pick one of these packets and
process it. In most cases, this is lcore 10.
109 DPDK vRouter Architecture
Timer lcore
Timer lcore is responsible for managing the timer list and executing timer call-
backs of the different timers in vRouter. The timers include internal DPDK library
timers for bonding or vRouter timers for fragmentation, etc. It executes an API
provided by the DPDK library called “rte_timer_manage()”, which manages the
timers. The precision of the timer depends on the call frequency of this function.
The more often the function is called, the more CPU resources it will use. In the
case of vRouter, the precision is 100us.
Uvhost lcore
Uvhost lcore is responsible for handling the messages between the qemu and the
vRouter, sometimes called the “vhost-user control channel.” It handles both the
cases – when qemu is server or qemu is client. The communication is through
UNIX sockets. Once the communication channel is established, the userspace
vhost protocol takes place. During this protocol message exchange, qemu and the
vRouter exchange information about the VM’s memory regions, virtio ring ad-
dresses, and the supported features. At the end of the message exchange, the virtio
ring is enabled and data communication between VM and vRouter can take place.
[2020-10-25 10:49:33]root@bcomp80:~
$ ps -eaf|grep 11736
root 11736 19907 99 Oct20 ? 39-14:31:44 /usr/bin/contrail-vrouter-dpdk --no-
daemon --vdev eth_bond_bond0,mode=4,xmit_policy=l34,socket_id=1,mac=14:02:ec:66:b8:dc,slave=0000:87:
00.0,slave=0000:09:00.1 --vlan_tci 722 --vlan_fwd_intf_name bond0 --socket-mem 1024 1024
111 DPDK vRouter Architecture
In the above example, vRouter DPDK creates the UNIX socket named “uvh_vif_
tap3dd8d56d-ca”. This socket name is passed by the Contrail controller to the
vRouter agent. The agent then adds this virtual interface to the vRouter DPDK
using the Netlink channel. In parallel, qemu process is spawned by the nova plugin
which waits for this socket to be created. Once created by the vRouter process,
qemu then initiates a connection to this socket as a client. Hence the name “client
mode qemu”.
This is a not a default mode but can be enabled in the configuration, although this
mode is not preferred due to the “reconnect issue.” This means when the vRouter
DPDK process is restarted, the VMs also need to be restarted to trigger the vhost-
user protocol.
Further, the virsh XML shows that the socket mode is “server”:
(nova-libvirt)[root@a7s4-kiran /]# virsh dumpxml 10 | grep server -B2 -A5
<interface type='vhostuser'>
<mac address='02:35:d2:a9:12:fe'/>
<source type='unix' path='/var/run/vrouter/uvh_vif_tap35d2a912-fe' mode='server'/>
<model type='virtio'/>
<driver queues='8'/>
<alias name='net0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
</interface>
In the above example, libvirt (instead of vRouter DPDK) creates the UNIX socket
named “uvh_vif_tap3dd8d56d-ca” with the help of Contrail nova plugin. The
socket name is then passed by the Contrail controller to the vRouter agent. The
agent then adds this virtual interface to the vRouter DPDK using the Netlink chan-
nel. The vRouter then initiates a connection to this socket as a client, with qemu
being the server. Hence the name “server mode qemu”.
This is the default and preferred mode since it avoids the above-mentioned “recon-
nect issue.” This means when the vRouter DPDK process is restarted, it is the re-
sponsibility of the vRouter to connect to qemu and to trigger the vhost-user
protocol to set up the virtio data channel.
112 Chapter 3: Contrail DPDK vRouter Architecture
The above command shows the two UNIX sockets that are used to send and receive
packets between the vRouter agent and vRouter DPDK. The first line of the output
shows the dpdk_pkt0 socket and is owned by vRouter DPDK process. The second
line of the output shows the agent_pkt0 socket and is owned by the vRouter agent
process.
When the agent wants to send a packet, it uses the send() system call to send it the
UNIX socket dpdk_pkt0. As soon as that happens, since the vRouter DPDK is lis-
tening on that socket, the poll() system call breaks and the packet is read using the
read() system call. The buffer being read needs to be converted into the “mbuf”
structure, which DPDK understands. To accomplish this, a new mempool called
the “packet_mbuf_pool” is created during initialization. This mempool has a col-
lection of mbufs. A new mbuf is allocated from this mempool and the buffer is
copied into the mbuf. It is then routed like a regular packet received on the pkt0
(vif0/2) interface of vRouter. This processing happens in the context of this packet
lcore. See Figure 3.15.
“packet_tx” ring and wakes up the packet lcore using the event_fd that it reg-
istered. The packet lcore then wakes up and drains this ring and uses the send()
system call to send the packet to the agent using the socket agent_pkt0.
Netlink lcore
Netlink lcore is responsible for establishing a communication channel with the
agent for programming the forwarding state (like routes, next hops, labels, etc.).
See Figure 3.16.
The first line of the output shows the state as “LISTENING” for DPDK vRouter,
which indicates that it is a server and is waiting for clients such as agent to connect
to it.
The second line shows the agent connected to it and so the state is
“CONNECTED”.
The protocol that is carried in this socket is Netlink, which means all messages have
a Netlink header that is 24 bytes in size followed by the payload. The socket type is
UNIX. The Netlink header is comprised of the following:
Netlink message header
Netlink attribute
115 DPDK vRouter Architecture
The header can be viewed easily using gdb to the DPDK vRouter:
(gdb) ptype struct nlmsghdr Netlink message header
type = struct nlmsghdr {
unsigned int nlmsg_len;
unsigned short nlmsg_type;
unsigned short nlmsg_flags;
unsigned int nlmsg_seq;
unsigned int nlmsg_pid;
}
(gdb) ptype struct genlmsghdr Generic netlink message header
type = struct genlmsghdr {
__u8 cmd;
__u8 version;
__u16 reserved;
}
(gdb) ptype struct nlattr Netlink attribute
type = struct nlattr {
__u16 nla_len;
__u16 nla_type;
}
(gdb) p sizeof(struct nlmsghdr) + sizeof(struct genlmsghdr) + sizeof(struct nlattr)
$1 = 24
The payload of this message is in “Sandesh” format. This is a proprietary data for-
mat (similar to XML) used by the agent and vRouter. The format is:
The object name specifies the type of object the message contains - like next hop,
route, MPLS, etc.
The type can be fixed length datatypes like uint8, uint16, uint32. It can also be a
variable length data type like “list,” in which case there will be a “length” field to
specify the length of the list.
These messages are parsed by inbuilt parser and appropriate callbacks called de-
pending on the objects. Example: For a next hop object, the next hop callback
within vRouter is called, which in turn programs that next hop in the next hop
table.
If the vRouter needs to return a status or error message to the agent after process-
ing the Sandesh object, it can do so. That way the agent gets to know if the pro-
gramming was successful or not.
116 Chapter 3: Contrail DPDK vRouter Architecture
Packets stored in vif RX rings are polled by a forwarding lcore. There is a one-to-
one mapping between forwarding cores and the NIC’s Rx queues. The polled
packets are then processed by either the same lcore or different one, and pushed to
a target vif’s TX ring, as shown in Figure 3.18.
Each of the lcore10 and higher started by a DPDK vRouter is a polling and a pro-
cessing thread. They run based on the CPU list defined by CPU_LIST variable dur-
ing provisioning.
Figure 3.19 Load Balancing in the Case of MPLS Over GRE Overlay
Consequently, all packets coming from VMs located on the same compute will be
received only in one DPDK RX ring of the vif0/0 interface (the vRouter interface
connected to the underlay network). So incoming MPLS GRE overlay packets will
not be well balanced onto the different forwarding cores.
The DPDK pipeline model will used to mitigate this. One lcore will perform only
packet polling and another lcore will perform packet processing.
A hash algorithm is applied onto the decapsulated packet headers (inner packet) in
order to increase the entropy. As a result of this mechanism, packets are well bal-
anced across all the available forwarding lcores, even if the encapsulation is MPLS
over GRE.
Figure 3.20 Load Balancing in the Case of MPLS Over UDP and VXLAN Overlays
It is more efficient to use UDP overlay protocols. Performance ratings with the
same DPDK vRouter configuration are higher when an UDP overlay protocol is
chosen instead of MPLS over GRE.
NOTE You also have to take into consideration that the current DPDK vRouter is
unable to process correctly a multi-queue vNIC, which is configured with more
queues than the number of forwarding threads configured.
Supported Scenarios
Contrail DPDK vRouter supports DPDK virtual machines as well as Linux Kernel
virtual machines. Likewise, a contrail Kernel vRouter supports both DPDK and
non-DPDK virtual machines. See Figure 3.22.
However, only two scenarios really make sense:
Kernel mode vRouter supporting kernel mode virtual machines
In the Kernel scenario, both VMs and Contrail vRouter work with a regular Linux
TCP/IP stack using interrupt mode packet processing. They both suffer the same
limitations (packet processing does not scale due to interrupt mode) and the same
advantages (not required to book lots of CPU for packet processing). So this sce-
nario is best used when the VMs do not expect a high network connectivity
performance.
In the DPDK scenario, both VMs and Contrail vRouter work with a DPDK library
using poll mode packet processing. Both suffer from the same limitations and have
the same advantages as poll mode requires some CPUs for packet processing, and
it allows you to reach line rate packet processing. This scenario is the best used
when the VMs require a high network connectivity performance, which is typi-
cally, Virtual Network Functions (VNF).
122 Chapter 3: Contrail DPDK vRouter Architecture
Hybrid cases are unsuitable, but unavoidable in certain circumstances. Many VMs
have both kernel and DPDK interfaces. Generally, kernel interfaces are used for
management purposes and DPDK interfaces are used for data.
When a Kernel mode VM is plugged into a DPDK vRouter, it impacts the whole
Contrail vRouter and performance suffers. The DPDK vRouter has to emulate in-
terrupt mode using KVM features in order to kick the VM. It involves a “VMEx-
it,” which is like a system call to the hypervisor and takes lots of CPU cycles. This
not only impacts the kernel mode VM but all the other DPDK VMs as well.
A DPDK VM plugged into a Contrail kernel mode vRouter is also very inefficient.
Even if the VM is able to process its network packets at a very high speed, the
Linux kernel packet processing used by kernel mode vRouter does not scale well.
So, at the end, lots of packets generated by a high speed VNF plugged on a Con-
trail kernel mode vRouter could be lost.
This is why Contrail users have to be consistent and to plug data interfaces of
DPDK VMs into a DPDK vRouter and data interfaces of kernel mode VMs into
kernel mode vRouter.
When virtual infrastructure is made up of several kinds of VMs (both DPDK and
non DPDK), placement strategy has to be defined in order to spawn DPDK VM
into computes fitted with Contrail DPDK vRouter and to spawn non-DPDK VMs
into computes fitted with Contrail kernel mode vRouter.
Chapter 4
Defining the huge pages memory to be used by the DPDK vRouter to create
vRouter interface DPDK rings for physical and VM NICs.
Configuring the number of queues of DPDK vRouter physical and VM NICs.
Queues will be configured automatically with a one-to-one mapping. On
physical NICs, the vRouter will configure as many queues as the number of
allocated polling cores. For each VM NIC, the vRouter will bind each queue to
a single polling core. That means the vRouter provides one-to-one polling
core/queue mapping until the number of VM queues are not bigger than
vRouter allocated cores.
In Centos or RedHat Enterprise Linux, Contrail vRouter DPDK-specific setup is
defined in the /etc/sysconfig/network-scripts/ifcfg-vhost0 configuration file. To ac-
tivate changes, the vRouter agent vhost0 network interface has to be recreated to
get the modified set up enforced:
$ sudo ifdown vhost0
$ sudo ifup vhost0
124 Chapter 4: Contrail DPDK vRouter Setup
Using the following command, you can display PCI identifier of physical interfac-
es, which are available in the Linux OS:
$ sudo lshw -class network | grep pci@
bus info: pci@0000:02:01.0
bus info: pci@0000:02:02.0
bus info: pci@0000:03:00.0
Once the Contrail DPDK vRouter has been started, you can see the actual physical
interfaces used for the underlay network interconnection:
$ sudo docker exec contrail-vrouter-agent-dpdk /opt/contrail/bin/dpdk_nic_bind.py -s
Network devices using DPDK-compatible driver
============================================
0000:02:01.0 '82540EM Gigabit Ethernet Controller' drv=uio_pci_generic unused=e1000
0000:02:02.0 '82540EM Gigabit Ethernet Controller' drv=uio_pci_generic unused=e1000
Network devices using kernel driver
===================================
0000:03:00.0 'Virtio network device' if= drv=virtio-pci unused=virtio_pci,uio_pci_generic
Other network devices
=====================
<none>
Some CPUs for the VMs. Generally, this is the main purpose of the your
virtual infrastructure creation.
Some CPUs for the vRouter high-speed packet processing (polling, processing,
and forwarding steps).
This section only considers servers with a NUMA architecture and hyper-thread-
ing enabled. The term CPU will be used to mean both logical cores (main core and
its sibling) created on each physical core. The term core will be used to mean a sin-
gle logical core. In this section, each CPU is made up of two cores (physical and its
sibling). It also assumes a containerized version of OpenStack and Contrail is be-
ing used. In Figure 4.1, the virtual infrastructure architect is starting to define the
number of CPUs to be allocated in each group as described.
In order to get the best performance, CPUs allocated to VMs and to vRouter have
to be isolated from those that are kept to the Linux OS. CPU isolation is the first
setup to be completed to define the CPUs that will no longer be used by the Linux
OS. Those CPUs will be dedicated to the DPDK vRouter or used by OpenStack
Nova to spawn VMs.
Figure 4.2 shows the CPU core topology of a two sockets system with 2*12 physi-
cal cores (hyper-threading enabled). This topology will be used in the configura-
tion examples provided in the following sections.
NUMA node0 CPU(s):
PHY cores: 0 2 4 6 8 10 12 14 16 18 20 22
HT cores : 24 26 28 30 32 34 36 38 40 42 44 46
NUMA node1 CPU(s):
PHY cores: 1 3 5 7 9 11 13 15 17 19 21 23
HT cores : 25 27 29 31 33 35 37 39 41 43 45 47
It is possible to remove some CPUs using the isolcpus kernel parameter. This ker-
nel parameter has to be provisioned at the system startup. The GRUB configura-
tion is updated to define isolcpus parameter and then the system restarted.
This next example keeps only CPU 0,1,24, and 25 for the Linux OS, excluding
them from the isolcpus list. It’s strongly recommended to use at least the first CPU
(main core and its sibling) on each NUMA:
$ vi /etc/default/grub
GRUB_CMDLINE_LINUX="console=tty0 console=ttyS0,115200n8 crashkernel=auto rhgb quiet default_
hugepagesz=1GB hugepagesz=1G hugepages=28 iommu=pt intel_iommu=on isolcpus=2-23,26-47"
$ grub2-mkconfig -o /etc/grub2.cfg
You also need to specify the CPUs that have to be used by the Linux OS in the Sys-
temd configuration file (this step is useless when RedHat OS is used with tuned as
described next):
$ vi /systemd/system.conf
CPUAffinity=0-1,24-25
$ sudo systemctl deamon-reexec
$ sudo systemctl system.slice restart
When tuned is used, the CPU Affinity value will automatically be overwritten with
the CPUs that are not listed in isolated_cores. This is important in order to keep
enough CPUs for the Linux OS. Not-isolated CPUs are used by all tasks started
and managed by the Linux OS scheduler:
System configuration and control tasks
NOTE The mask for CPUs 2,4,6,8,26,28,30,32 maps to the binary value:
b0000 0000 0000 0001 0101 0100 0000 0000 0000 0001 0101 0100
(0x154000154h).
128 Chapter 4: Contrail DPDK vRouter Setup
NOTE The mask for CPUs 10,34 allocated maps to binary value:
b0000 0000 0000 0100 0000 0000 0000 0000 0000 0100 0000 0000
(0x400000400h).
NOTE These parameters can use two different syntaxes: mask or list, the same as
for CPU_LIST.
In order to get these changes taken into consideration, the Nova compute service
must be restarted:
$ sudo docker restart nova_compute
Some of these huge pages to be allocated for the vRouter Physical NIC
The huge pages allocated to be visible from both DPDK vRouter application
and VMs
The DPDK vRouter detects the huge page hugetlbfs mount point. Here, the DPDK
vRouter will try to use 1GB huge pages. If no page size is specified, the DPDK
vRouter assumes 2MB huge pages have to be used. If no huge pages of the specified
(or 2MB if size is not specified) are available, the Contrail DPDK vRouter will fail
to start.
The amount of huge page memory requested by the vRouter at start-up for its phys-
ical NIC DPDK rings setup is specified in the socket-mem parameter. In order for
the vRouter to request huge pages memory only on the first NUMA socket, we are
using this option with only one parameter:
--socket-mem <value>
130 Chapter 4: Contrail DPDK vRouter Setup
In order for the vRouter to request huge pages memory on both the NUMA0 and
NUMA1 sockets, now we are using this option with only two parameters:
--socket-mem <value>,<value>
It is important to allocate huge page memory to all NUMA nodes that will have
DPDK interfaces associated with them. If memory is not allocated on a NUMA
node associated with a physical NIC or VM, they cannot be used. If you are using
two or more ports from different NICs, it is best to ensure that these NICs are on
the same CPU socket.
Here we are configuring the vRouter to request 1GB huge pages memory on both
NUMA nodes:
$ vi /etc/sysconfig/network-scripts/ifcfg-vhost0
DPDK_COMMAND_ADDITIONAL_ARGS="--socket-mem 1024,1024"
$ sudo ifdown vhost0
$ sudo ifup vhost0
The following parameters are used for DPDK vRouter physical NIC
configuration:
--vr_mempool_sz : this is used to define mempool memory size. Default value is
16384.
--dpdk_txd_sz : this is used to define Physical NIC TX Ring descriptor size.
Default value is 256.
--dpdk_rxd_sz : this is used to define Physical NIC RX Ring descriptor size.
Default value is 256.
The following formula has to be used to define the mempool size:
--vr_mempool_sz = 2 * (dpdk_txd_sz + dpdk_rxd_sz) * number_of_vrouter_cores *
number_of_ports_in_dpdk_bond
Next we are configuring the vRouter physical NIC DPDK rings with 512 RX and
TX, eight cores, and two ports in a bond. Based on the formula for descriptors, the
mempool size should be 32MB:
$ vi /etc/sysconfig/network-scripts/ifcfg-vhost0
DPDK_COMMAND_ADDITIONAL_ARGS="--dpdk_rxd_sz 512 --dpdk_txd_sz 512 --vr_mempool_sz 32768"
$ sudo ifdown vhost0
$ sudo ifup vhost0
NOTE Physical NIC DPDK ring size modification can lead to some unexpected
side effects (packet loss). The mempool size needed depends on the configured
maximum packet size (physical NIC MTU), the number of NICs using the physi-
cal bond, and the configured number of RX and TX descriptors.
Two parameters are used for the DPDK vRouter internal queue (software rings)
configuration:
--vr_dpdk_tx_ring_sz: This is used to define forwarding lcores TX Ring descrip-
tor size (1024 by default)
--vr_dpdk_rx_ring_sz: This is used to define forwarding lcores RX Ring descrip-
tor size (1024 by default).
Here we are configuring the vRouter internal rings with 2048 RX and TX
descriptors:
$ vi /etc/sysconfig/network-scripts/ifcfg-vhost0
DPDK_COMMAND_ADDITIONAL_ARGS="--vr_dpdk_rx_ring_sz 2048 --vr_dpdk_tx_ring_sz 2048"
$ sudo ifdown vhost0
$ sudo ifup vhost0
In order to get these changes taken into consideration, Nova compute service has
to be restarted:
$ sudo docker restart nova_compute
The VM NIC and vRouter vif to which each interface is connected are sharing the
same queues (DPDK rings):
a vRouter vif tx ring is the same as the virtual NIC rx ring it is connected to.
a vRouter vif rx ring is the same as the virtual NIC tx ring it is connected to.
It avoids duplicating the same information and processing overhead (that would
be generated to manage data copy between vRouter vif and the Virtual Machine
queues).
133 Virtual Machine vif Multiqueue Setup
This is why VM NIC queues have to be accessible from both vRouter and the VM
it belongs to. VMs have to be created by the QEMU/KVM hypervisor with a spe-
cific property that allows them access to the host OS huge pages and to request
huge page allocations.
Huge pages size to be allocated by the hypervisor to the VM has to be specified
with hw:mem_page_size. The configured huge pages memory size must be the
same as those used by the DPDK vRouter (defined into huge page size hugetlbfs
mount point).
Here we are configuring an OpenStack flavor named m1.large, which defines 1GB
size hugepages in hw:mem_page_size property:
$ openstack flavor set m1.large --property hw:mem_page_size=1GB
When an instance is started with multiqueue vif property enabled, each interface is
automatically configured with several queues. The number of queues to be enabled
on each interface is automatically defined by Nova.
If the compute host (hypervisor node running qemu/kvm) is running Linux Kernel
3.X, the number of queues configured on the VM interface is the same as the num-
ber of virtual CPUs configured on the VM, but can’t exceed eight queues. That
means for a VM configured with ten vCPUs, all its virtual network interface cards
will be configured with eight queues when multiqueue is enabled.
134 Chapter 4: Contrail DPDK vRouter Setup
If the compute host (hypervisor node running qemu/kvm) is running Linux Kernel
4.X, the number of queues configured on the VM interface is the same as the num-
ber of virtual CPUs configured on the VM but can’t exceed 256 queues. That
means for a VM configured with ten vCPUs, all its virtual network interface cards
will be configured with ten queues when multiqueue is enabled.
As explained earlier, Contrail vRouter is not able to process packets generated by
connected virtual network interface cards configured with more queues than the
number of CPU defined into its CPU_LIST (number of polling and processing
cores defined on Contrail vRouter).
Consequently, a Contrail vRouter configured with only four polling and process-
ing cores won’t be able to collect a VM configured with ten vCPUs if vif multi-
queue property enabled is connected.
One of the following changes has to be performed:
disable multiqueue on the VM
add more polling and processing cores on the vRouter (increase to eight cores
instead of only four)
decrease the number of queues configured by Nova on the VM
In order to override their default values, you can configure an updated value using
DPDK_COMMAND_ADDITIONAL_ARGS parameter defined in vhost0 DPDK
vRouter configuration file. For instance, we can decrease the next hops table size to
32K instead of 512K configured by default:
$ vi /etc/sysconfig/network-scripts/ifcfg-vhost0
DPDK_COMMAND_ADDITIONAL_ARGS=”--vr_nexthops=32768”
$ sudo ifdown vhost0
$ sudo ifup vhost0
All these parameters could increase vRouter performance but could also have a
negative impact when not properly configured.
The previous chapters have gone through most of the important topics about SDN
and about DPDK in general: DPDK vRouter architectures, vRouter packet pro-
cessing details, and so on. When you read about these topics you may wonder how
to get a running Contrail networking environment with a few DPDK vRouters in
it, so you can play around, test those theories, and familiarize yourself with what
you’ve learned. Indeed, those topics are important, but unfortunately they are not
so straightforward. So even after we’ve put great effort into illustrating how they
work, some of them may still sound confusing, especially when you get down to
the implementation details.
This chapter and Chapter Six focus mostly on hands-on lab testing to verify some
of the most important DPDK vRouter concepts and working mechanisms.
In this chapter, we’ll begin by introducing steps we’ve used to install the latest
version of Contrail networking cluster.
On top of that, we’ll start to build a testing environment. That includes a few
VMs running OPENFV PROX software. On each VM, based on its role, the
PROX software is configured as either a traffic generator or a traffic receiver.
In Chapter Six we’ll introduce some of the commonly used DPDK tools,
scripts, and log entries to provides useful information to help you understand
how things run in a DPDK environment.
At the end of Chapter Six, we’ll go over some case studies. We’ll use PROX
and rapid we’ve installed to start different traffic patterns in the setup, and
then use DPDK tools to analyze what we are seeing.
After reading these two chapters, you will have a deeper understanding of some of
the main concepts covered in this book. Let’s start with Contrail installation.
139 Contrail Installation
Contrail Installation
Installation Methods
We’ve been focusing on the DPDK vRouter that runs in each individual compute
node, which basically runs in a relatively standalone mode. But if you look at the
forwarding plane as a whole, those nodes are actually a distributed system. In fact,
as discussed in Chapter 1, the whole TF cluster is a complex distributed system
involving a lot more different software modules, especially in control plane.
Again, each of the software modules can be a completely different distributed sys-
tem by themselves.
The Cassandra database that the TF cluster uses is one such example. Explaining
and understanding details about how things works in a distributed system is never
easy, nor is the installation process. Don’t be surprised if you run into some instal-
lation issues in your lab. Generally speaking, it is always much more efficient to
follow a detailed, verified process with step-by-step instructions to avoid issues,
rather than starting with a try-and-see mode and then trying to fix issues.
Currently, the TF cluster has been widely integrated with many major deployment
systems and platforms. Therefore, depending on your environment, there can be
totally different ways of installing the Contrail system. Here is an incomplete list
of currently supported installation methods:
Installing a Nested Red Hat OpenShift Container Platform 3.11 Cluster Using
Contrail Ansible Deployer
For example, in the second method listed above, you can install Contrail with Red
Hat OpenStack platform director 13 (RHOSPd), which is a toolset based on the
OpenStack project TripleO (OOO, or OpenStack on OpenStack). A TF
140 Chapter 5: Contrail Networking and Test Tools Installation
bond members
$ cat ifcfg-bond0 $cat ifcfg-enp2s0f0
SUBCHANNELS=1,2,3 HWADDR=00:1b:21:bb:f9:46
NM_CONTROLLED=no SLAVE=yes
BOOTPROTO=none NM_CONTROLLED=no
BONDING_OPTS="miimon=100 mode=802.3ad xmit_hash_ BOOTPROTO=none
policy=layer3+4" MASTER=bond0
DEVICE=bond0 DEVICE=enp2s0f0
BONDING_MASTER=yes ONBOOT=yes
ONBOOT=yes
$cat ifcfg-bond0.101 $cat ifcfg-enp2s0f1
HWADDR=00:1b:21:bb:f9:46 HWADDR=00:1b:21:bb:f9:47
SLAVE=yes SLAVE=yes
NM_CONTROLLED=no NM_CONTROLLED=no
BOOTPROTO=none BOOTPROTO=none
MASTER=bond0 MASTER=bond0
DEVICE=enp2s0f0 DEVICE=enp2s0f1
ONBOOT=yes ONBOOT=yes
Once the restart is successful, you should see bond0 interface appearing in all
nodes with one of these IP addresses in each node: 8.0.0.1 to 8.0.0.4. Now you
should have IP connectivity in both the management network and fabric network.
Next you’ll need to install Ansible and use it to automate the rest of the installa-
tions. Most of Ansible’s magic is performed through its playbooks, and configura-
tion for all playbooks is done in a single file with a default name instances.yaml.
This configuration file has multiple main sections. We’ll go over some of the main
parameters in this file and then introduce the steps to run the playbooks. Here’s
the configuration file for instances.yaml:
1 global_configuration:
2 CONTAINER_REGISTRY: svl-artifactory.juniper.net/contrail-nightly
3 REGISTRY_PRIVATE_INSECURE: True
4 provider_config:
5 bms:
6 ssh_pwd: c0ntrail123
7 ssh_user: root
8 ntpserver: 10.84.5.100
9 domainsuffix: englab.juniper.net
10 instances:
11 a7s2:
12 provider: bms
13 ip: 10.84.27.2
14 roles:
142 Chapter 5: Contrail Networking and Test Tools Installation
15 openstack_control:
16 openstack_network:
17 openstack_storage:
18 openstack_monitoring:
19 config_database:
20 config:
21 control:
22 analytics_database:
23 analytics:
24 webui:
25 a7s3:
26 provider: bms
27 ip: 10.84.27.3
28 ssh_user: root
29 ssh_pwd: c0ntrail123
30 roles:
31 openstack_compute:
32 vrouter:
33 PHYSICAL_INTERFACE: bond0.101
34 CPU_CORE_MASK: 0x1fe
35 DPDK_UIO_DRIVER: uio_pci_generic
36 HUGE_PAGES: 32000
37 AGENT_MODE: dpdk
38 a7s4:
39 provider: bms
40 ip: 10.84.27.4
41 ssh_user: root
42 ssh_pwd: c0ntrail123
43 roles:
44 openstack_compute:
45 vrouter:
46 PHYSICAL_INTERFACE: bond0.101
47 CPU_CORE_MASK: 0x1fe
48 DPDK_UIO_DRIVER: uio_pci_generic
49 HUGE_PAGES: 32000
50 AGENT_MODE: dpdk
51 a7s5:
52 provider: bms
53 ip: 10.84.27.5
54 ssh_user: root
55 ssh_pwd: c0ntrail123
56 roles:
57 openstack_compute:
58 vrouter:
59 PHYSICAL_INTERFACE: bond0.101
60 contrail_configuration:
61 CONTRAIL_VERSION: 2008.108
62 OPENSTACK_VERSION: rocky
63 CLOUD_ORCHESTRATOR: openstack
64 CONTROLLER_NODES: 8.0.0.1
65 OPENSTACK_NODES: 8.0.0.1
66 CONTROL_NODES: 8.0.0.1
67 KEYSTONE_AUTH_HOST: 8.0.0.200
68 KEYSTONE_AUTH_ADMIN_PASSWORD: c0ntrail123
69 RABBITMQ_NODE_PORT: 5673
70 KEYSTONE_AUTH_URL_VERSION: /v3
71 IPFABRIC_SERVICE_IP: 8.0.0.200
72 VROUTER_GATEWAY: 8.0.0.254
143 Contrail Installation
73 two_interface: true
74 ENCAP_PRIORITY: VXLAN,MPLSoUDP,MPLSoGRE
75 AUTH_MODE: keystone
76 CONFIG_API_VIP: 10.84.27.51
77 ssh_user: root
78 ssh_pwd: c0ntrail123
79 METADATA_PROXY_SECRET: c0ntrail123
80 CONFIG_NODEMGR__DEFAULTS__minimum_diskGB: 2
81 CONFIG_DATABASE_NODEMGR__DEFAULTS__minimum_diskGB: 2
82 DATABASE_NODEMGR__DEFAULTS__minimum_diskGB: 2
83 XMPP_SSL_ENABLE: no
84 LOG_LEVEL: SYS_DEBUG
85 AAA_MODE: rbac
86 kolla_config:
87 kolla_globals:
88 kolla_internal_vip_address: 8.0.0.200
89 kolla_external_vip_address: 10.84.27.51
90 contrail_api_interface_address: 8.0.0.1
91 keepalived_virtual_router_id: 111
92 enable_haproxy: yes
93 enable_ironic: no
94 enable_swift: no
95 kolla_passwords:
96 keystone_admin_password: c0ntrail123
97 metadata_secret: c0ntrail123
98 keystone_admin_password: c0ntrail123
line 3: set to True if containers that are pulled from a private registry (named
CONTAINER_REGISTRY) are not accessible
line 4-9: provider-specific settings
line 10-59: Instances means the node on which the containers will be
launched. Here we defined four nodes, named a7s2, a7s3, a7s4, and a7s5,
respectively.
line 11-24: this is the configuration section for node a7s2
line 12-14: this server’s provider type (baremetal server), ip address, and roles
line 14-24: roles of containers that will be installed in this node, according to
the configuration, this server a7s2 will be installed with all “controller”
software modules, in both OpenStack and Contrail
line 25-37: parameters for our first DPDK compute node. OpenStack compute
components and Contrail vRouter will be installed in it.
144 Chapter 5: Contrail Networking and Test Tools Installation
line 51-59: defines the third vRouter, this one is a kernel-based, so we don’t
need any DPDK specific parameters
line 60-85: contrail_configuration section contains parameters for Contrail
services
line 61-62: Contrail and OpenStack versions
line 63: the cloud orchestrator. It can be OpenStack or vcenter. Our setup is
with OpenStack only.
line 64-66: who is the controller node. In our setup both OpenStack and
contrail controllers are installed in same node.
line 71, 76: these are the two virtual IPs configured
line 80-82: these are needed only for lab setup. Without these parameters,
contrail-status command will print a warning to indicate that the storage
space is not big enough.
line 86-98: the parameters for Kolla
line 88-89: VIPs configured for management and data control network,
respectively. One usage of these VIPs is to make it possible to access the
OpenStack horizon service (webUI) from management network, by default all
OpenStack services listen on the IP in data/ctrl network. With these VIPs
configured and used by keepalived, HAproxy can forward the access request
coming from the management network to the Horizon service.
145 Installation Steps
Installation Steps
Once the YAML file is carefully prepared, the installation process is relatively easy.
Basically you select one node as the deployment node, the node from where you
want to automate the installation of all other nodes. In practice, use the controller
node as the deployment node.
In this node you need to install some prerequisite software, such as python librar-
ies, ansible, git, etc., and the python modules (python-wheel) that are used by An-
sible, and Ansible is our automation tool. git is used to clone a github repository
which includes all Ansible playbooks. Then you use Ansible to automate the soft-
ware installation in all the nodes based on the playbook and your configuration
file instances.yaml. The details start here:
1. Install prerequisite packages on deployment node, in this case, it’s the controller
a2s2:
2. Use git to clone install the Ansible deployer folder into deployment node:
git clone http://github.com/tungstenfabric/tf-ansible-deployer
cd tf-ansible-deployer
After everything loads, you will have an up and running four-node Contrail cluster
(one controller node and three vRouter/compute nodes). You can log in to the set-
up via a webUI or ssh session to check system status.
Post-installation Verification
Here is the Contrail web UI for a working setup in Figure 5.2.
146 Chapter 5: Contrail Networking and Test Tools Installation
You can also log in to each individual node with SSH, and the run contrail-status
command to verify the running status of all the components as shown in Figure 5.3.
If everything works, congratulations! You now have your own lab to play in. Now
let’s go over the steps of setting up testing tools to send and receive traffic - the
PROX and rapid script.
147 DPDK vRouter Test Tools
PROX
PROX (Packet pROcessing eXecution Engine) is an OPNFV project application
built on top of DPDK. It is capable of performing various operations on packets in
a highly configurable manner. It also supports performance statistics that can be
used for performance investigations. Because of the rich feature set it supports, it
can be used to create flexible software architectures through small and readable
configuration files. This section introduces you to how to use it to test vRouter per-
formance in DPDK environments.
In a typical test you need two VMs running PROX. VM1 generates packets, send-
ing them to VM2, which will perform a swap operation on all packets so that they
are sent back to VM1.
traffic generator VM (gen VM)
This book calls them gen and swap VM, respectively. One special feature used here
is that the swap PROX is configured in such a way that once it receives the packets
sent from the generator, it will swap, or loop, them back to the generator VM so
the latter can collect them and calculate how much traffic was forwarded by the
DUT - in this case it’s the DPDK vRouter.
Rapid
Rapid (Rapid Automated Performance Indication for Data Plane) is a group of
wrapper scripts interacting with PROX to simplify and automate the configura-
tion of PROX. It’s a set of files and scripts offering an even easier way to do sanity
checks of the data plane performance.
Rapid is both powerful and configurable. A typical workflow works as follows:
A script name runrapid.py sends the proper configuration files to the gen and
swap VMs involved in the testing, so each one knows its role (generator or
swapper) in the test.
It then starts PROX within both VMs, as generator and swapper, respectively.
While the test is going on it collects the results from PROX. Results are visible
onscreen and logged in the log and csv files.
The same tests will be done for different packet sizes and different amounts of
flows, under certain latency and packet drop rates.
148 Chapter 5: Contrail Networking and Test Tools Installation
The rapid scripts are typically installed in a third VM, called jump VM in this
book. The purpose of this VM is to control the traffic generator to start, stop, and
pause the test, as well as to collect statistics.
The test consists of three compute nodes running the mentioned three VMs,
respectively:
PROX generated VM runs on compute-A: This is the traffic generator VM for
traffic generation.
PROX looping VM runs on compute-B: This is the swap VM for looping
traffic out of the same interface where it came in. This is the DUT (device
under test) where the vRouter is running.
Rapid jump VM runs on compute-C: This is the VM where rapid scripts are
installed; it is responsible for control traffic generation and collecting results.
Hardware Requirements
Here’s a brief summary of hardware requirements for different VMs:
Swap VM: This is where the DUT (vRouter) is located. Based on the test
requirements, a specific amount of hardware resources should be allocated
and all applications that could unnecessarily consume the hardware resources
should be removed.
149 DPDK vRouter Test Tools
Gen VM: In order to saturate the DUTs, the traffic generator VM and the
compute should allocate much more CPU resources than the DUT.
Jump VM: No high speed VM is required, and it can be run on the kernel or
the DPDK compute.
Optionally, the generator and receiver computes can run on a bonded inter-
face configured with 802.3ad LACP mode. This is a common configuration
recommended in a practical environment.
The PROX setup in this book’s lab is shown in Figure 5.5.
By default, multi-queue is enabled on both the PROX gen and swap VMs via
OpenStack image property. You can refer to Chapter 3 for more details about the
multi-queue feature and its configurations. Additionally, rapid scripts also provide
CPU pinning to protect PROX PMDs against CPU-stealing by other processes and
the VM OS.
150 Chapter 5: Contrail Networking and Test Tools Installation
Installation
Creating OpenStack Resources
As mentioned earlier, to perform the test we need two VMs both running PROX.
One sending traffic and the other receiving and swapping it back. The same exact
PROX application is running but here with different configuration files.
Apparently, the IP level connectivity is required in order for the two VMs to be
able to exchange packets with each other. In this case, the two VMs will be
spawned by OpenStack Nova. Needless to say, all supporting objects and resourc-
es associated with the VMs, like IPAM, subnet, virtual-network, and VM flavor
(size of CPU/memory/storage/etc.), also need to be created out of OpenStack infra-
structure, either from horizon webUI or OpenStack CLIs. A quick list of common
tasks are listed here:
create IPAMs/subnets/virtual networks
create flavors
create images
create instances
create key-pairs
On top of these, installing PROX inside of the VMs, like with many other open-
source projects, often requires downloading the source code and compiling it on
your platform. That means you download the PROX source codes, compile them
to get the execute, then configure and run the application. In this section we intro-
duce you to how PROX is installed in the setup we built for this book.
MORE? You can find more details in PROX website here: https://wiki.opnfv.org/
display/SAM/PROX+installation.
In our lab setup the VM OS is the same as the host, and the emulated CPU Model
is Intel Xeon E3-12xx:
[root@stack2-gen ~]# cat /etc/centos-release
CentOS Linux release 7.7.1908 (Core)
151 DPDK vRouter Test Tools
NOTE There is a good chance that your servers and VMs have totally different
hardware and software architectures. The steps here are tested and working fine in
the book’s lab setup, but depending on your environment, you may run into some
errors. Check PROX online documentation for more detailed instructions.
NOTE The stable and recommended version of DPDK at the time of writing this
book is 19.11.
Compiling PROX
Now with the DPDK libraries built, let’s start to download, extract, and build the
PROX application. Here are the steps:
git clone https://github.com/opnfv/samplevnf
cd samplevnf/VNFs/DPPD-PROX
git checkout origin/master
make
152 Chapter 5: Contrail Networking and Test Tools Installation
When make succeeds, the compiled binary PROX will be available in the build
folder of the current directory. (We’ll demonstrate this shortly.)
Configuration Files
The set of sample configuration files can be found in: ./config folder. Sample con-
figs of PROX functioning as the generator is available in ./gen/ folder. Assuming
that the current directory is where you’ve just built PROX, you can just launch
PROX with a proper configuration file:
./build/prox -f <prox configuration file>
When it runs, a ncurse-based UI will pop up, and through it you will see updates
about the running states in real time. We’ll give an example on this shortly.
Rapid Installation
Rapid scripts can be downloaded from here: https://github.com/opnfv/samplevnf/
tree/master/VNFs/DPPD-PROX/helper-scripts/rapid .The scripts were developed
in Python, so you can run them directly with no need to compile.
Repeating these manual steps will become tedious and even painful. To simplify
the building, creation, and configuration of PROX, as well as creating all neces-
sary OpenStack resources, the number one choice for automation is heat. With
heat all tasks are typically programmed in a template file, which calls all param-
eters from another environment file. In the github site for this book we provide all
sample template files, as well as an environment file and associated scripts which
are tested and proven to be working fine, at least in our setup. You can use them as
a starting point, then make necessary customizations based on your environment
to build your own automation. The VM, where the tools are running, including
the rapid scripts and PROX DPDK applications pre-compiled on it, also has been
built as an image.
153 DPDK vRouter Test Tools
With all these automations carefully designed and tested, what we need to do now
becomes much more simple:
download this pre-built image and load it into OpenStack image service
If everything goes well, you will have your whole PROX testing environment
available in just a few minutes. The detail steps are listed below:
1. Prepare pre-built VM image, heat template files, and scripts:
VM image: this is the image with PROX compiled, as shown in the previous
section.
Adjust the heat template, environment variables, and automation scripts based on
your environment: (These files are available in this https://github.com/pinggit/
dpdk-contrail-book.)
environment.yaml
build-rapid.yml
configure.rapid.sh
Wait for a few minutes and use the openstack stack list command to check the
stack creation progress shown in Figure 5.6.
5. Once loaded you can use these different sub-commands (see Figure 5.7) of the
openstack stack command to retrieve the parameters of the stack components (see
Figure 5.8):
openstack stack list STACK
openstack stack resource list
openstack stack resource list --filter type=OS::Nova::Server
openstack stack show STACK
openstack stack output show STACK
155 DPDK vRouter Test Tools
In our test we didn’t configure any floating IP, so we will use console and meta_ip_
address to access the VM. To access the VM console use the virsh console com-
mand from nova_libvirt docker in the compute node:
[root@a7s3 ~]# docker exec -it nova_libvirt virsh list
Id Name State
----------------------------------------------------
2 instance-00000041 running
[root@a7s3 ~]# docker exec -it nova_libvirt virsh console 2
Connected to domain instance-00000041
Escape character is ^]
CentOS Linux 7 (Core)
Kernel 3.10.0-1062.18.1.el7.x86_64 on an x86_64
stack2-gen login: root
Password:
Last login: Fri Sep 25 17:31:21 from 192.168.0.2
[root@stack2-gen ~]#
Compared with the console, an SSH session is usually preferred. Let’s take a look
at each VM’s allocated interface IPs with openstack server list (as seen in Figure
5.9) command:
Good, vif0/3 has the IP, so this vif connects to the tap interface of our jump VM.
In Contrail vRouter, for each vif there is also a hidden meta_data_ip of
169.254.0.N, where N is the same number as the number in the interface vif0/N.
Therefore, in this case, the meta_data_ip is 169.254.0.3. Let’s try to start an SSH
session into it:
[root@a7s5-kiran ~]# ssh 169.254.0.3
Password:
157 DPDK vRouter Test Tools
It works. The benefit of this approach is that not only is the interaction with the
VM much faster, but it also supports file copies with the scp tool. Remember, in
many cases the VM does not have any internet connection, so in case you need to
copy files into (or out of) the VM, the meta_data_ip method will be especially useful.
This is a symbolic link, by default this rapid folder links to: /opt/openstackrapid/
samplevnf/VNFs/DPPD-PROX/helper-scripts/rapid/.
This will start rapid script and send traffic for ten seconds by default. the period of
time for sending traffic can be adjusted by the --runtime option:
cd /root/prox/helper-scripts/rapid/
./runrapid.py --runtime <time> # replace <time> with time per one execution in seconds
A few other command line options are supported, which can be listed by -h:
[root@stack2-jump rapid]# ./runrapid.py -h
usage: runrapid [--version] [-v]
[--env ENVIRONMENT_NAME]
[--test TEST_NAME]
[--map MACHINE_MAP_FILE]
[--runtime TIME_FOR_TEST]
[--configonly False|True]
[--log DEBUG|INFO|WARNING|ERROR|CRITICAL]
[-h] [--help]
Command-line interface to runrapid
optional arguments:
-v, --version Show program's version number and exit
--env ENVIRONMENT_NAME Parameters will be read from ENVIRONMENT_NAME. Default is rapid.env.
--test TEST_NAME Test cases will be read from TEST_NAME. Default is basicrapid.test.
--map MACHINE_MAP_FILE Machine mapping will be read from MACHINE_MAP_FILE. Default is machine.
map.
--runtime Specify time in seconds for 1 test run
--configonly If this option is specified, only upload all config files to the VMs, do not
run the tests
--log Specify logging level for log file output, default is DEBUG
--screenlog Specify logging level for screen output, default is INFO
-h, --help Show help message and exit.
158 Chapter 5: Contrail Networking and Test Tools Installation
You can see that some preparation work was done before the actual test was start-
ed. First, the script read three files, rapid.env, basicrapid.test, and machine.map. The
env file provides IP/MAC information of the gen and swap VM, and the .test file
defines all detail behavior of the test.
1. Then, the script connects to both gen and swap VM.
2. The script starts some small amount of traffic as warmup, to test the reachability
between the source and destination, and also to populate the MAC table or ARP
table in devices along the path.
3. When everything is ready, the script starts the traffic and at the same time moni-
tors the traffic’s receiving rate in real time. Any packet drop rate higher than the
defined threshold indicates the current traffic rate is too high to the DUT, so it will
drop the rate in the next iteration. By binary search, it eventually finds the maxi-
mum throughput between the two systems within a given allowed packet loss and
accuracy which are defined in the *.test files (for example, the basicrapid.test file
for a simple test).
The script is highly configurable. The book’s github site for this book provides a
sample basicrapid.test used in our lab. You can start with it and fine tune based on
your needs. For example, in section [test2] of the file you can change the rate of
flow and packet size to define different test scenarios.
[test2]
test=flowsizetest
159 DPDK vRouter Test Tools
packetsizes=[64,256,512,1024,1500]
# the number of flows in the list need to be powers of 2, max 2^20
# Select from following numbers: 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384,
32768, 65536, 131072, 262144, 524280, 1048576
flows=[16384, 65536]
PROX will parse its configuration file /root/gen.cfg and start to boot. From the
booting messages in the screen you can learn its booting sequences:
set up the DPDK environment (RTE EAL)
initialize tasks
start the test and display a ncurse based text UI
You will end up with a ncurse based UI like that shown in Figure 5.11.
The display shows per task statistics that include estimated idleness, per second
statistics for packets received, transmitted or dropped, per core cache occupancy,
cycles per packet, etc. These statistics can help pinpoint bottlenecks in the system.
This information can then be used to optimize the configuration. There are quite a
few other features including debugging support, scripting, Open vSwitch support,
and more. Refer to the PROX website for more details. For now, let’s look at how
160 Chapter 5: Contrail Networking and Test Tools Installation
You will end up with a similar ncurse-based text UI, after a similar booting process
of the sender. Once the swap end of PROX is up and running, you will immedi-
ately see both RX and TX counters (Figure 5.13) keep updating on both sides of
the traffic.
That concludes our discussion of PROX and rapid as testing tools. We’ll use these
tools extensively in the rest of this book to generate different kinds of traffic in
each test. With the traffic running, you can dig deeper in order to understand the
rules about how vRouter works. Now let’s introduce some of the commonly used
tools that are designed for, or especially useful for, verifications in the DPDK
vRouter environment.
Chapter 6
Now you are inside of the container. From here you can test all the old vRouter
tools you may have been familiar with, for example, printing the packet dropping
statistics:
(contrail-tools)[root@a7s3 /]$ dropstats | grep -iEv 0$|^$
Flow Action Drop 1792
Flow Queue Limit Exceeded 305
Invalid NH 12
No L2 Route 1
We use grep to remove all counters with a zero value. When you are done, just exit
the Docker and it will be killed:
(contrail-tools)[root@a7s3 /]$ exit
exit
[root@a7s3 ~]#
You can also pass the tool command as parameters to the script, execute the com-
mand, get its output, and exit the Docker all with one go:
[root@a7s3 ~]# contrail-tools dropstats | grep -iE route
No L2 Route 68129939
[root@a7s3 ~]#
163 DPDK vRouter Tool Box
As of this writing, there are nearly twenty tools available in this container. Let’s
take a look at what’s in the package.
First, in the container locate the package name:
[root@a7s3 ~]# contrail-tools
lcontrail-tools)[root@a7s3 /]$ rpm -qa | grep contrail-tool
contrail-tools-2008-108.el7.x86_64
Then, based on the package name, you can list all available tools in it:
(contrail-tools)[root@a7s3 /]$ repoquery -l contrail-tools-2008-108.el7.x86_64 | grep bin
/usr/bin/dpdkinfo
/usr/bin/dpdkvifstats.py
/usr/bin/dropstats
/usr/bin/flow
/usr/bin/mirror
/usr/bin/mpls
/usr/bin/nh
/usr/bin/pkt_droplog.py
/usr/bin/qosmap
/usr/bin/rt
/usr/bin/sandump
/usr/bin/vif
/usr/bin/vifdump
/usr/bin/vrfstats
/usr/bin/vrftable
/usr/bin/vrinfo
/usr/bin/vrmemstats
/usr/bin/vrouter
/usr/bin/vxlan
In previous chapters you’ve read about the dpdk_nic_bind.py script, which is a tool
to bind a specific driver for a NIC. In the rest of this section, we’ll introduce some
more tools that are especially useful in the DPDK environment.
vhost0 interface
However, some of the most important interfaces are not shown at all:
The physical fabric interface: the bond interface in our setup
If you compare this with what you’d see with the same IP command in a kernel
mode vRouter compute without DPDK, there’s a big difference:
[root@a7s5-kiran ~]# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 0c:c4:7a:47:d7:b4 brd ff:ff:ff:ff:ff:ff
3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 0c:c4:7a:47:d7:b5 brd ff:ff:ff:ff:ff:ff
4: enp2s0f0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_
UP> mtu 1500 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
link/ether 00:1b:21:bb:f9:46 brd ff:ff:ff:ff:ff:ff
5: enp2s0f1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_
UP> mtu 1500 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
link/ether 00:1b:21:bb:f9:46 brd ff:ff:ff:ff:ff:ff
6: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_
UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 00:1b:21:bb:f9:46 brd ff:ff:ff:ff:ff:ff
12: docker0: <NO-
CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
link/ether 02:42:d6:c6:2c:12 brd ff:ff:ff:ff:ff:ff
41: pkt1: <UP,LOWER_UP> mtu 65535 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/void c2:6e:97:ef:cd:b2 brd 00:00:00:00:00:00
42: pkt3: <UP,LOWER_UP> mtu 65535 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/void 8e:44:4e:2e:28:0c brd 00:00:00:00:00:00
43: pkt2: <UP,LOWER_UP> mtu 65535 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/void a6:2a:01:7c:db:65 brd 00:00:00:00:00:00
44: vhost0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_
fast state UNKNOWN mode DEFAULT group default qlen 1000
link/ether 00:1b:21:bb:f9:46 brd ff:ff:ff:ff:ff:ff
45: bond0.101@bond0: <BROADCAST,MULTICAST,UP,LOWER_
UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 00:1b:21:bb:f9:46 brd ff:ff:ff:ff:ff:ff
46: pkt0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_
fast state UNKNOWN mode DEFAULT group default qlen 1000
link/ether 5e:a0:f8:77:25:97 brd ff:ff:ff:ff:ff:ff
49: tap0160123b-14: <BROADCAST,MULTICAST,UP,LOWER_
UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether fe:01:60:12:3b:14 brd ff:ff:ff:ff:ff:ff
165 DPDK vRouter Tool Box
Here except for the lo and management interface, and whatever we saw from the
DPDK compute, you can also see these other important interfaces:
bond interface and its sub-interface: bond0, bond0.101
NOTE The pkt1, pkt2, and pkt3 interfaces are created by vRouter but not used in
DPDK setup.
The reason you can see these differences, as we’ve mentioned many times through-
out this book, is that when DPDK is in charge of the NIC card the Linux kernel is
mostly bypassed. The NIC card’s feature and functions are exposed by another spe-
cial driver directly connected to the user space PMD driver running in the DPDK
layer, so the traditional applications that rely on the interfaces sitting in the Linux
kernel to do their job are no longer useful.
We’ll talk more about this later. For now, let’s look at the vif command with -l|--
list and --get option. The vif --list lists all interfaces located in the vRouter and
--get just retrieves one of them. Here is the capture from the same DPDK compute:
......
vif0/3 PMD: tap41a9ab05-64 NH: 32
Type:Virtual HWaddr:00:00:5e:00:01:00 IPaddr:192.168.1.104
Vrf:3 Mcast Vrf:3 Flags:PL3L2DMonEr QOS:-1 Ref:12
RX queue packets:2306654691 errors:0
RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 0 0
RX packets:2306869103 bytes:285898139558 errors:0
TX packets:47613036 bytes:5739655392 errors:0
ISID: 0 Bmac: 02:41:a9:ab:05:64
vif0/2 Socket: unix
Type:Agent HWaddr:00:00:5e:00:01:00 IPaddr:0.0.0.0
Vrf:65535 Mcast Vrf:65535 Flags:L3Er QOS:-1 Ref:3
RX port packets:71548 errors:0
RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 0 0
RX packets:71548 bytes:6153128 errors:0
TX packets:14936 bytes:1359697 errors:0
Drops:0
vif0/3: this is the vRouter interface connecting the data interface of our PROX
VM: tap41a9ab05-64
vif0/4: this is the vRouter interface connecting the control and management
interface of our PROX VM: tapd2d7bb67-c1
Now you should understand the importance of the vif command, especially in the
DPDK vRouter. It shows interfaces from vRouter’s perspective, and reveals the
one-to-one connection mapping between vRouter and fabric or the VM tap inter-
face. The latter would otherwise be invisible.
167 DPDK vRouter Tool Box
Besides that, it also displays key information. The vrf numbers and packet counters
are the most commonly used data points, but among various counters we usually
focus on the RX/TX packets/bytes counters, which display data received or sent in
packets or bytes. Depending on your environment, sometimes you may also see
non-zero numbers in the RX/TX queue packets/errors counter that gives inter-lcore
packet statistics. This usually happens when two lcores are involved in the packet
forwarding path. Let’s use this command intensively in the rest of this chapter and
analyze the lcores to understand some of the important vRouter working
mechanisms.
The vif tool also supports some other options such as --help to display a brief list of
all currently supported options:
[root@a7s3 ~]# contrail-tools vif --help
Usage: vif [--create <intf_name> --mac <mac>]
[--add <intf_name> --mac <mac> --vrf <vrf>
--type [vhost|agent|physical|virtual|monitoring]
--transport [eth|pmd|virtual|socket]
--xconnect <physical interface name>
--policy, --vhost-phys, --dhcp-enable]
--vif <vif ID> --id <intf_id> --pmd --pci]
[--delete <intf_id>|<intf_name>]
[--get <intf_id>][--kernel][--core <core number>][--rate] [--get-drop-stats]
[--set <intf_id> --vlan <vlan_id> --vrf <vrf_id>]
[--list][--core <core number>][--rate]
[--sock-dir <sock dir>]
[--clear][--id <intf_id>][--core <core_number>]
[--help]
We won’t cover every option and its usage and usually you don’t need to know any
except --get and -l|--list. But there is one more, (--add), which we’ll talk about
shortly. The --clear option will reset all counters, and this is handy to set a quick
clean baseline for later observations, which we’ll give an example of later.
MORE? For other options, refer to the Juniper documentation at: https://www.
juniper.net/documentation/en_US/contrail20/topics/task/configuration/vrouter-cli-
utilities-vnc.html.
Now let’s look at two useful scripts developed based on the vif command: dpdkvif-
stats.pyand vifdump.
dpdkvifstats.py script
We’ve seen that the vif command displays all interfaces and their traffic statistics
(RX/TX packets/bytes/errors, RX queue packets/errors, etc.) in the form of a list.
During testing or troubleshooting, you can collect this data to evaluate the vRout-
er’s forwarding performance, its running status, whether it is losing packets or not,
etc. In production, you always need to examine the traffic passing through a com-
pute. It’s the same thing in a lab, once you start traffic from the PROX or any other
traffic generator, the first thing you want to check is the traffic rate on the interfaces.
168 Chapter 6: Contrail DPDK vRouter Toolbox and Case Study
Starting from R2008, a Python script named dpdkvifstat.py is provided, which col-
lects the statistics from vif output, calculates the changing rate of all counters in
pps (packet per second) and bps (bit per second), based on both per-lcore and total
statistics. It then displays the result in a table format. This makes the output pret-
tier and the comparison across vif interfaces much easier to read.
In fact the vif command also provides --list --rate options to display traffic rates.
However, it is lacking itemized per-lcore statistics and the display is not easy to col-
lect in a file.
To demonstrate how the script works (see Figure 6.1), in our testbed we have con-
figured PROX to send traffic at a constant speed of 125000 bytes per second (Bps)
with minimum packet size of 60 bytes. That calculates to about 1.48K packet per
second.
We typically run the script two times. First, to show traffic rate for vif0/3 (-v), then
to show traffic rate for all (-a) vif interfaces for comparison purposes. In both ex-
ecutions, per-lcore statistics of a specific interface are given separately. With the -v
option, the total value of the interface is also given, which is in addition of coun-
ters from all cores. This gives a per-interface statistic. With -a, the script also calcu-
lates RX/TX/RX+TX traffic rate for each lcore across all interfaces in the end. This
gives the overall lcore forwarding load in the DPDK vRouter.
169 DPDK vRouter Tool Box
To understand the output, let’s first review the DPDK vRouter CPU cores alloca-
tion. In Chapter 3, you learned about DPDK vRouter architectures and you
learned how the packet processing works. Basically: vRouter creates the same
number of lcores and DPDK queues as the number of CPUs allocated to it.
For testing purposes, in this compute we’ve allocated two CPU cores to vRouter
DPDK forwarding lcores. CPU allocation to DPDK vRouter forwarding lcores is
configurable via vRouter configuration files. Refer to Chapter 4 for CPU alloca-
tion details. For each vRouter interface, two DPDK queues are created, each
served by a forwarding lcore in DPDK process. That is why in the output for each
vif interface there are two lines statistics, for Core 1 and Core 2, respectively.
This capture shows vRouter interface vif 0/3 processed 1501 pps traffic in the first
forwarding lcore, that is 720640 bits per second for 60 bytes packet size. These
show the majority of the traffic forwarded out of fabric interface vif 0/0, with a
similar rate of 1512 pps. These overlay packets received from the VM will be tun-
neled in extra underlay encapsulations, MPLSoUDP in this case, so the vif 0/0 bps
number (1282176) will be a little bit bigger comparing with the number on the VM
interface.
This script conveniently gives a straightforward overview about the current traffic
profile from vRouter’s perspective. To compare with the original vif output which
the script is based on, let’s check what the raw data looks like without the dpdkvif-
stats.py script:
[root@a7s3 ~]# vif --clear; sleep 1; vif --get 3 --core 10; vif --get 3 --core 11
......
vif0/3 PMD: tap41a9ab05-64 NH: 34
Type:Virtual HWaddr:00:00:5e:00:01:00 IPaddr:192.168.1.104
Vrf:2 Mcast Vrf:2 Flags:PL3L2DEr QOS:-1 Ref:12
Core 11 RX queue packets:1496 errors:0
RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 0 0
Core 11 RX packets:0 bytes:0 errors:0
Core 11 TX packets:0 bytes:0 errors:0
ISID: 0 Bmac: 02:41:a9:ab:05:64
Drops:131
We captured the interface data, waited for one second, and then captured it again.
After that we can calculate the differences of all the counters between the two cap-
tures to get the increasing rate of each counter:
170 Chapter 6: Contrail DPDK vRouter Toolbox and Case Study
vifdump Script
In many Linux machines, tcpdump comes with the OS as part of a standard package.
With it you can capture whatever packets sensed by a NIC, which can be either
physical NIC or virtual NIC, like a tuntap interface. Both NICs are visible to the ker-
nel. In DPDK environments, the difficulty of an interface not being visible to the
kernel makes tcpdump unworkable, unless you just want it to read packets from a file.
Fortunately, we now know that each interface related to the vRouter data plane
connects to a unique vRouter interface (vif). We can make use of this fact and create
an alternative. vifdump is a shell script and when invoked it uses the --add option of
the vif command to create a monitoring tun interface in the Linux kernel, and inter-
nally the vRouter will clone all data passing through the monitored vif interface to
this kernel interface. vifdump will then start up the tcpdump program to capture the
packets from the monitoring tun interface. From a user’s perspective, the script
works the same way as with tcpdump. Here are two captures on vif0/3 toward VM,
which is our PROX gen, and on vif0/0 toward the fabric interface:
[root@a7s3 ~]# contrail-tools vifdump -i 3 -n -c 3
vif0/3 PMD: tap41a9ab05-64 NH: 32
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on mon3, link-type EN10MB (Ethernet), capture size 262144 bytes
13:12:31.286528 IP 192.168.1.104.filenet-cm > 192.168.1.105.filenet-nch: UDP, length 82
13:12:31.286532 IP 192.168.1.104.filenet-rmi > 192.168.1.105.filenet-pch: UDP, length 82
13:12:31.286540 IP 192.168.1.104.filenet-rpc > 192.168.1.105.filenet-pa: UDP, length 82
3 packets captured
401 packets received by filter
271 packets dropped by kernel
vifdump: deleting vif 4348...
171 DPDK vRouter Tool Box
dpdkvifstats.py and vifdump are two scripts developed based on the vif command.
With these tools we can collect general packet RX/TX counters and packet con-
tents. In the next section, we’ll take a look at another powerful debug tool that is
useful in the DPDK environment: dpdkinfo.
dpdkinfo Command
We’ve talked about vif and dpdkvifstats.py tools. Now let’s introduce a relatively
new tool that can be used to investigate lower level details of DPDK interfaces. It’s
dpdkinfo and it was introduced in Contrail 2008. With it, Contrail operators can
collect more information about DPDK vRouter fabric interface internal status,
connectivity (physical NIC bond), DPDK library information, and some other
statistics.
Let’s first run the tool with -h to get a brief menu of it:
(contrail-tools)[root@a7s3 /]$ dpdkinfo -h
Usage: dpdkinfo
--help
--version|-v Show DPDK Version
--bond|-b Show Master/Slave bond information
--lacp|-l <all/conf> Show LACP information from DPDK
--mempool|-m <all/<mempool-name>> Show Mempool information
--stats|-n <eth> Show Stats information
--xstats|-x <=all/=0(Master)/=1(Slave(0))/=2(Slave(1))>
Show Extended Stats information
--lcore|-c Show Lcore information
--app|-a Show App information
Optional: --buffsz <value> Send output buffer size (less than 1000Mb)
172 Chapter 6: Contrail DPDK vRouter Toolbox and Case Study
From this help information we can see it provides information about DPDK inter-
face in multiple areas. In this rest of this section, let’s take a look at some of the
most useful options:
--version|-v
--bond|-b
--lacp|-l
--stats|-n
--xstats|-x
--lcore|-c
There are some other options like --app|-a, --mempool|-m that we won’t introduce
in this book, and the list of supported functions may grow in each future release.
But you will get the basic idea of its usage and you can refer to the official docu-
ments for up-to-date information.
version
The -v or --version option reports the basic version information of DPDK release
in use:
(contrail-tools)[root@a7s3 /]$ dpdkinfo -v
DPDK Version: DPDK 19.11.0
vRouter version: {build-info: [{build-time: 2020-09-04 10:38:22.330666, build-
hostname: 6fb64a1f86b9, build-user: root, build-version: 2004}]}
The command output also displays each member link’s information, its current
driver, MAC address, up/down status, etc.
Since LACP is running, LACP parameters for each member link are displayed. An-
other way to show this information is with -l|--lacp option:
[root@a7s3 ~]# contrail-tools dpdkinfo -l all
LACP Rate: slow
Here, you can get more insight of LACP running stats, including all LACP timers
and PDU statistics about number of packets exchanged with the peer device. Of
course, here the counters are LACP PDU only. If you need all packets received and
sent through the bond interface, you can use -n|--stats option.
dpdkvifstats.py -v X
The DPDK bond interface is represented by the vRouter interface vif0/0, so you
may think setting X to 0 in the above commands achieves the same effect. The
problem is none of these tools print packet statistics for each member link of the
bond. Let’s take a look at an example here:
[root@a7s3 ~]# contrail-tools dpdkinfo --stats eth
Master Info:
RX Device Packets:28360664, Bytes:3233321316, Errors:0, Nombufs:0
Dropped RX Packets:0
TX Device Packets:28361174, Bytes:3234763122, Errors:0
Queue Rx: [0]28360664
Tx: [0]28361174
Rx Bytes: [0]3233321316
Tx Bytes: [0]3234760294
Errors:
---------------------------------------------------------------------
Slave Info(0000:02:00.0):
RX Device Packets:1421, Bytes:129257, Errors:0, Nombufs:0
Dropped RX Packets:0
TX Device Packets:28358167, Bytes:3234235595, Errors:0
175 DPDK vRouter Tool Box
Slave Info(0000:02:00.1):
RX Device Packets:28359275, Bytes:3233195707, Errors:0, Nombufs:0
Dropped RX Packets:0
TX Device Packets:3039, Bytes:531175, Errors:0
Queue Rx: [0]28359275
Tx: [0]3039
Rx Bytes: [0]3233195707
Tx Bytes: [0]531175
Errors:
---------------------------------------------------------------------
Now, start the rapid script to send 64 flows, and check same dpdkinfo
command output again:
[root@a7s3 ~]# contrail-tools dpdkinfo -n eth
Master Info:
RX Device Packets:471211, Bytes:53724144, Errors:0, Nombufs:0
Dropped RX Packets:0
TX Device Packets:471189, Bytes:53719798, Errors:0
Queue Rx: [0]471211
Tx: [0]471190
Rx Bytes: [0]53724144
Tx Bytes: [0]53719884
Errors:
---------------------------------------------------------------------
Slave Info(0000:02:00.0):
RX Device Packets:228370, Bytes:26033818, Errors:0, Nombufs:0
176 Chapter 6: Contrail DPDK vRouter Toolbox and Case Study
Dropped RX Packets:0
TX Device Packets:220073, Bytes:25090326, Errors:0
Queue Rx: [0]228370
Tx: [0]220076
Rx Bytes: [0]26033818
Tx Bytes: [0]25090640
Errors:
---------------------------------------------------------------------
Slave Info(0000:02:00.1):
RX Device Packets:242872, Bytes:27693860, Errors:0, Nombufs:0
Dropped RX Packets:0
TX Device Packets:251148, Bytes:28633120, Errors:0
Queue Rx: [0]242872
Tx: [0]251158
Rx Bytes: [0]27693860
Tx Bytes: [0]28634260
Errors:
---------------------------------------------------------------------
From the member link packet statistics, you can see that the traffic is balanced on
both links.
Now that you understand the -stats|-n option provides the insight of member link
usage reflected by a few RX/TX counters. Base on this information you can deter-
mine the load balance status of a DPDK bond interface. So far, all of the packet
counters we’ve seen, no matter under master or members, are almost the same
ones as what are provided by the vif command. In practice, if you need to get more
extensive statistics, there is another option: xstats|-x. Let’s check it out:
[root@a7s3 ~]# contrail-tools dpdkinfo -xall | grep -v : 0
Master Info:
Rx Packets: Rx Bytes:
rx_good_packets: 852475379 rx_good_bytes: 97185979648
rx_q0packets: 852475379 rx_q0bytes: 97185979648
Tx Packets: Tx Bytes:
tx_good_packets: 852853117 tx_good_bytes: 97253818091
tx_q0packets: 852853127 tx_q0bytes: 97253769503
Errors:
Others:
------------------------------------------------------------------
lcore
There are several key concepts we’ve been trying to illustrate in this book. Among
others, at least three of them are often mentioned together: lcore, interface, and
queue, but before we start introducing the fourth, the -c|--lcore option, let’s briefly
review these concepts:
lcore: lcore is a thread in vRouter DPDK process running in user space.
interface: This is the endpoint of connections between the vRouter and the
other VM, or between vRouter and the outside of the compute. At the vRouter
and VM end, the interfaces are called vif and tap interfaces, respectively. There
are also bond0 physical interface in DPDK user space and vhost0 interface in
Linux kernel. The former is the physically NIC bundle connecting to the peer
device, and the latter gives the host an IP address and through which the
vRouter agent can exchange control plane messages with the controller.
Queue:For each interface there are some queues created. They are essentially
allocated memory to hold the packets.
The CPU cores connect all these objects together. As of the writing of this book,
the implementation is to have one-to-one mapping between the number of CPU
cores allocated to vRouter and the number of interface queues. For example, if
four CPUs are allocated to the DPDK vRouter forwarding threads (the lcores),
then four lcores will be created, and four DPDK interface queues will be created
for each vif interface. The same rule applies to the VM. You assign four CPU cores
to a VM, then by default, Nova will create four queues for a tap interface in the
VM. That said, of course, multiple queue as a feature needs to be turned on in
Nova. We can illustrate with Table 6.1.
179 DPDK vRouter Tool Box
This is just a simple example. In production deployment there are a lot more con-
ditions to consider, and a lot of confusion arises. Common questions are:
What if the tap interface queue number is different than the vif queue number?
What will happen when there are eight lcores but one of our VMs is running
four queues in its tap interface?
Will vif0/3 queue0 always be served by lcore0, instead of other lcores? If not,
how to determine which vif queue goes to which lcore? Is there a chance that
imbalanced lcores to queue mapping happens, so that some lcores are over-
loaded and some lcores are relatively idle?
To answer these questions, we need a tool to reveal the secret of actual mapping
between lcores and queues from different vif interfaces. This is the moment for
-c|--lcore option of dpdkinfo to show its power. Again, let’s start with an example:
Lcore 1:
Interface: bond0.101 Queue ID: 1
Interface: tap41a9ab05-64 Queue ID: 0
180 Chapter 6: Contrail DPDK vRouter Toolbox and Case Study
Let’s start from the first line. In this example, we have allocated two CPU cores to
the DPDK vRouter forwarding lcores, so we have two forwarding lcores running
in total.
Then, the second line provides the number of vRouter interfaces in the compute.
We have four of them in total. One vif0/4 connecting to VM tap interface
tap41a9ab05-64, and three mandatory, vif0/0, vif0/1, and vif0/2, connecting to bond,
vhost0, and pkt0, respectively. Here, we have created just one VM (actually this is
nothing but the PROX gen VM we’ve created earlier) with only one tap interface.
Let’s focus on the third line onward. The output is listing all forwarding lcores that
are currently configured in vRouter, and for each lcore it lists interfaces that each
lcore is associated with, in another words, interfaces this core is serving.
Please note that there are some inconsistencies in terms of the lcore numbering in
different tools:
In dpdkvifstats.py script, the forwarding lcore number starts from one, so Core
1 refers to the first forwarding lcore.
In dpdkinfo -c output, forwarding lcore number starts from zero, so Lcore 0
refers to the first forwarding lcore.
In vif output, forwarding lcore number starts from ten, so --core 10 refers to
the first forwarding lcore.
This can cause confusion in our discussions. To make it consistent, in the rest of
this chapter we’ll use the first forwarding lcore, fwd lcore#10, or simply lcore#10; the
second forwarding lcore, fwd lcore#11, or simply lcore#11, and so on, to indicate
Lcore 0, Lcore 1 in dpdkinfo-c output, Core 1, Core 2 in dpdkvifstats.py script
output, and Core 10, Core 11 in vif output, respectively. Here’s a better
visualization:
Okay, as you may have realized, in the VM interface we use just one queue, which
means the multiple queue feature on the VM interface is not enabled. Therefore
the VM tap interface has only one queue connecting to its peering vRouter inter-
face. Correspondingly, only one queue in the vRouter interface is needed and only
one lcore is required to serve the packet forwarding in the vif interface.
First, let’s look at the bond0 and vhost0 interfaces. The bond0 are the physical inter-
faces, and will always have multiple queues enabled, that is why it has two queues,
and both lcores serve it. The vhost0 interface is a control plane Linux interface. At
the time of writing of this book, the implementation is to hard-code vhost0 with
181 DPDK vRouter Tool Box
one queue only. The first forwarding thread lcore#10 got it. This is not the focus in
this section, but it’s helpful to understand the whole output.
Finally, let’s look at the last line, the VM tap interface. From the output, you can
see it is the second forwarding lcore (lcore#11) being assigned to this VM interface.
You’re probably wondering whether it’s just randomly chosen out of the two
lcores or if some algorithm is used. It is not like that. Currently the allocation basi-
cally follows a simple method: The least used lcore, in terms of the number of in-
terface queues it is serving, will be assigned to serve the next interface queue.
Based on what we just explained, lcore#10 took two interfaces (bond0.101 and
vhost0)while lcore#11 took just one (bond0.101), so it’s lcore#11’s turn to take the
next interface and queue.
The vNIC queues are assigned to logical cores in the following algorithm: the for-
warding core that is currently polling the least number of queues is selected, with a
tie won by the core with the lowest number (the first forwarding core lcore#10).
You’ll see more examples in later sections, when we’ll test out the tie breaker and
other things. For now, this mapping looks like this Table 6.2.
Now that we’ve gone through the dpdkinfo command and demonstrated some of
the most commonly used options, you can quickly display a lot of useful informa-
tion about DPDK and DPDK vRouter running status. We’ll review this again later
in our test case studies. The information is important before working on any de-
ployment or troubleshooting task in the setup. However, when things go wrong,
instead of just relying on the DPDK commands output, you may also want to
check into the log messages to verify the current running status is what you expect-
ed it to be.
Next we’ll take a look at DPDK vRouter log messages.
182 Chapter 6: Contrail DPDK vRouter Toolbox and Case Study
This log file contains lots of good information that is helpful to understand the
current running status. Of course, understanding the log messages is important
during a troubleshooting process.
You can see the complete list of vRouter start up parameters on this Contrail
vRouter, for example:
183 DPDK vRouter Tool Box
build-version 2008
it’s running DPDK Version 19.11.0
We can compare this information with what we can display with these command
line tools and see if they are consistent:
contrail-version
dpdkinfo -v
vrouter --info
taskset
NOTE In service threads, lcore3 to lcore7 are never used in Contrail DPDK
vRouter.
Here, the string --lcores means a service thread, or a forwarding thread. Following
this string are a few coupled numbers connected by @ - NUMBER@NUMBER,
which are separated by commas. How to decode these? Well, to understand this
you need to understand CPU pinning. To achieve maximum performance, we’re
pinning each of the service and forwarding threads (or lcores) with a few specific
CPU cores, so each thread will be served by dedicated CPUs that are isolated from
any other system tasks. So this log reads:
Service threads, that is lcore0 to lcore2, and lcore8-lcore9 in the message, are
all pinned to two CPU cores: core#10 and CPU core#34. The pinning is
configured by the SERVICE_CORE_MASK parameter.
Forwarding threads, lcore10 to lcore13, are allocated and are pinned to CPU
core#2, core#4, core#6 and core#8, respectively. This is configured from the
CPU_LIST parameter.
Here the logs show MPLSoGRE, but it actually applies to both MPLSoGRE or VxLAN
packets. Historically, this only happened when MPLSoGRE was supported. So it
remains like that in the software code. Here it means both MPLSoGRE and Vx-
LAN packets will be distributed via hashing by the polling core. See Figure 6.4.
Each time a new virtual interface is connected to the vRouter, a vif port is created
on the vRouter with the same number of queues as the number of polling CPUs
(specified in CPU_LIST parameter). Each queue is created and handled by only one
of the vRouter polling cores. So, for each vif we have a one-to-one mapping be-
tween vRouter polling cores and RX queues. This mapping can be seen from the
dpdkinfo -c command output we’ve introduced. The same can be observed in the
DPDK vRouter logs:
2019-09-24 16:36:50,011 VROUTER: Adding vif 8 (gen. 37) virtual device tap66e68bc1-a9
....
2019-09-24 16:36:50,012 VROUTER: lcore 12 RX from HW queue 0
2019-09-24 16:36:50,012 VROUTER: lcore 13 RX from HW queue 1
2019-09-24 16:36:50,012 VROUTER: lcore 10 RX from HW queue 2
2019-09-24 16:36:50,012 VROUTER: lcore 11 RX from HW queue 3
185 DPDK vRouter Tool Box
Here, the vif interface 0/8 is created in order to connect the virtual NIC ta-
p66e68bc1-a9 to the vRouter. Because four forwarding lcores are configured, this
vif is created with 4 queues, namely q0 to q3, which are respectively handled by
polling cores 12,13,10, and 11.
When a polling queue is enabled on the vRouter, a ring activation message is gen-
erated in the Contrail DPDK log file.
The vrings correspond to both transmit and receive queues, as seen in Figure 6.5:
The transmit queues are the even numbers. Divide them by two to get the
queue number: vring 0 is TX queue 0, vring 2 is TX queue 1, … and so on.
The receive queues are the odd numbers. Divide them by two (discard the
remainder) to get the queue number: vring 1 is RX queue 0, vring 3 is RX
queue 1, … and so on.
Ready state 1 = enabled, ready state 0 = disabled.
186 Chapter 6: Contrail DPDK vRouter Toolbox and Case Study
In this next example, only one RX (and TX) queue is enabled on the vRouter vif
interface. A single queue virtual machine interface is connected to the vRouter
port:
2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 0 ready state 1
2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 1 ready state 1
2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 2 ready state 0
2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 3 ready state 0
2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 4 ready state 0
2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 5 ready state 0
2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 6 ready state 0
2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 7 ready state 0
And in the next example, four RX (and TX) queues are enabled on the vRouter vif
interface, but a virtual machine interface having more than four queues is connect-
ed to the vRouter port:
2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 0 ready state 1
2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 1 ready state 1
2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 2 ready state 1
2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 3 ready state 1
2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 4 ready state 1
2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 5 ready state 1
2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 6 ready state 1
2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 7 ready state 1
2019-09-24 16:37:46,693 UVHOST: vr_uvhm_set_vring_enable: Can not disable TX queue 4 (only 4 queues)
187 DPDK vRouter Case Studies
Single Queue
Having understood the lcore mapping basics, let’s test some traffic.
188 Chapter 6: Contrail DPDK vRouter Toolbox and Case Study
......
Index Source:Port/Destination:Port Proto(V)
-----------------------------------------------------------------------------------
40196<=>436016 192.168.0.106:59514 6 (3)
192.168.0.104:22
(Gen: 1, K(nh):27, Action:F, Flags:, TCP:SSrEEr, QOS:-1, S(nh):36, Stats:503/35823,
SPort 56703, TTL 0, Sinfo 8.0.0.3)
436016<=>40196 192.168.0.104:22 6 (3)
192.168.0.106:59514
(Gen: 1, K(nh):27, Action:F, Flags:, TCP:SSrEEr, QOS:-1, S(nh):27, Stats:511/71619,
SPort 49812, TTL 0, Sinfo 4.0.0.0)
62792<=>172020 192.168.0.106:48664 6 (3)
192.168.0.104:8474
(Gen: 1, K(nh):27, Action:F, Flags:, TCP:SSrEEr, QOS:-1, S(nh):36, Stats:3828/296117,
SPort 63470, TTL 0, Sinfo 8.0.0.3)
172020<=>62792 192.168.0.104:8474 6 (3)
192.168.0.106:48664
(Gen: 1, K(nh):27, Action:F, Flags:, TCP:SSrEEr, QOS:-1, S(nh):27, Stats:2739/274615,
SPort 52648, TTL 0, Sinfo 4.0.0.0)
38232<=>257372 192.168.1.105:32768 17 (2)
192.168.1.104:32770
(Gen: 5, K(nh):30, Action:F, Flags:, QOS:-1, S(nh):37, Stats:0/0, SPort 61739,
TTL 0, Sinfo 0.0.0.0)
257372<=>38232 192.168.1.104:32770 17 (2)
192.168.1.105:32768
(Gen: 5, K(nh):30, Action:F, Flags:, QOS:-1, S(nh):30, Stats:390003/48360372,
SPort 62464, TTL 0, Sinfo 3.0.0.0)
Here, you can see six vRouter flows, which are in fact three groups. The first two
groups with index pairs 40196/436016 and 62792/172020, are generated by the control
messages from the rapid jump VM into the PROX gen VM. The last group of
flows with index pairs 38232/257372 is our single flow test traffic. The stats
39003/48360372 show the traffic flow is sent from gen VM (192.168.1.104:32770) to
swap VM (192.168.1.105:32768).
In Contrail vRouter, flows are generated in pairs. For any traffic, even if it is one
direction only, vRouter will generate a reverse flow for it. This is because in the
real world most of the traffic is bidirectional, so having a separate entry built for
each direction is required. In this case, from PROX, we are generating uni-direc-
tional traffic so only the flow of that direction has packet stats. The pairing flow
entry is generated as well, but packet statistics show nothing.
189 DPDK vRouter Case Studies
Let’s clear the vif counters and collect the statistics using dpdkvifstats.py tool:
[root@a7s3 ~]# contrail-tools vif --clear
From the first capture on the vRouter interface connecting to the PROX gen VM
tap interface (-v 3), we are seeing that lcore#10 received the traffic – you can tell
from the RX speed 1504 pps showing in Core 1 only. The second capture on the
vRouter interface toward the bond interface (-v 0) confirms the same – it is the
same lcore#10 (Core 1 here) that is sending the traffic to the bond interface, at
speeds of 1512 pps, almost the same as the speed it received the traffic from the
VM tap interface. The forwarding path is illustrated here:
VM: tap41a9ab05-64 => vif0/3 => lcore#10 => vif0/0 => bond0
This seems to be weird, doesn’t it? Remember, previously based on the core-inter-
face mapping given by dpdkinfo -c, we already know it was the lcore#11 serving
our VM interface, not the other one. Accordingly, in the dpdkvifstats.py output,
that should be Core 2 instead of Core 1. Let’s revisit the mapping:
[root@a7s3 ~]# contrail-tools dpdkinfo -c
No. of forwarding lcores: 2
No. of interfaces: 4
Lcore 0:
Interface: bond0.101 Queue ID: 0
Interface: vhost0 Queue ID: 0
Lcore 1:
Interface: bond0.101 Queue ID: 1
Interface: tap41a9ab05-64 Queue ID: 0
So, we’re right. The flow that is expected should be something like this:
VM: tap41a9ab05-64 => vif0/3 => lcore#11 => vif0/0 => bond0
Well, if you remember what you read in Chapter 3, you probably will know the
answer. When a packet flows from the PROX gen VM to the bond, vRouter uses a
pipeline model to process the packet. What that really means is, the interface’s serv-
ing lcore, that is the second forwarding lcore, in our case based on dpdkinfo -c out-
put, will poll it out of the vif interface. In Chapter 3, when we introduced the
190 Chapter 6: Contrail DPDK vRouter Toolbox and Case Study
vRouter packet forwarding process, we mentioned that when traffic flows from
the vif connecting VM tap interface to vif0/0, all packets will be distributed by the
polling lcore to other lcores for processing. The distribution is calculated based on
the hash of the packet header.
Apparently, the polling core here, based on the mapping above, is lcore#11, and the
only other lcore is the first forwarding lcore lcore#10. So packets from VM got
polled by the lcore#11 and then distributed to the lcore#10, which then forwarded
them to the fabric interface vif0/0. Currently dpdkvifstats.py does not tell us much
about these details, but if you collect vif output, you’ll see additional clues:
[root@a7s3 ~]# contrail-tools vif --get 3 --core 10
Vrouter Interface Table
......
vif0/3 PMD: tap41a9ab05-64 NH: 38
Type:Virtual HWaddr:00:00:5e:00:01:00 IPaddr:192.168.1.104
Vrf:2 Mcast Vrf:2 Flags:L3L2DEr QOS:-1 Ref:12
RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 0 0
Core 10 RX packets:31272 bytes:1876320 errors:0
Core 10 TX packets:0 bytes:0 errors:0
Drops:18660668
......
vif0/3 PMD: tap41a9ab05-64 NH: 38
Type:Virtual HWaddr:00:00:5e:00:01:00 IPaddr:192.168.1.104
Vrf:2 Mcast Vrf:2 Flags:L3L2DEr QOS:-1 Ref:12
Core 11 RX queue packets:35384 errors:0
RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 0 0
Core 11 RX packets:26 bytes:1092 errors:0
Core 11 TX packets:24 bytes:1008 errors:0
Drops:18660668
There is an RX queue counter, Core 11 RX queue packets:35384, that gives a very im-
portant clue about this inter-core load-balancing. Core 11, our second forwarding
lcore, polled the packet first from vif0/3 into its RX queue. Instead of processing
the packet, it distributed the packets onto the first forwarding lcore, Core10, which,
then processed them. That is why same number of packets are counted as RX pack-
ets in Core 10. Therefore, the full forwarding path of this traffic flow is like this:
For the sake of completeness, we also captured the vif command on fabric inter-
face vif0/0:
[root@a7s3 ~]# contrail-tools vif --get 0 --core 10
Vrouter Interface Table
......
vif0/0 PCI: 0000:00:00.0 (Speed 20000, Duplex 1) NH: 4
Type:Physical HWaddr:90:e2:ba:c3:af:20 IPaddr:0.0.0.0
Vrf:0 Mcast Vrf:65535 Flags:TcL3L2VpVofEr QOS:-1 Ref:18
191 DPDK vRouter Case Studies
......
vif0/0 PCI: 0000:00:00.0 (Speed 20000, Duplex 1) NH: 4
Type:Physical HWaddr:90:e2:ba:c3:af:20 IPaddr:0.0.0.0
Vrf:0 Mcast Vrf:65535 Flags:TcL3L2VpVofEr QOS:-1 Ref:18
RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 0 0
Fabric Interface: eth_bond_bond0 Status: UP Driver: net_bonding
Slave Interface(0): 0000:02:00.0 Status: UP Driver: net_ixgbe
Slave Interface(1): 0000:02:00.1 Status: UP Driver: net_ixgbe
Vlan Id: 101 VLAN fwd Interface: vfw
Core 11 RX packets:67 bytes:9860 errors:0
Core 11 TX packets:181 bytes:162062 errors:0
Drops:0
Here after the first forwarding lcore serving vif0/0 processed the packets, it sent them
out of vif0/0, which is reflected as TX packets.
One important thing to point out here is that what we’ve tested and demonstrated
here is the DPDK vRouter’s default behavior with the current parameters. Please keep
in mind that vRouter is very configurable. There is one vRouter configuration option
introduced in release R2008 which will change this default pipeline model behavior.
This option is --vr_no_load_balance, and you can verify the current running vRouter-
DPDK process command line in your setup with ps command. With that configured,
vRouter will change to the so-called run to complete model, which means that the same
lcore that polled the packet will continue to process/forward it. This requires reboot
of DPDK vRouter, and we’ll let you test the scenarios in your own lab.
This concludes the analysis of traffic forwarding in the direction of the VM to fabric.
Next let’s take a look at the returning direction: from fabric (vif0/0) to VM (vif0/3).
Here, we are looking at the returning traffic from the fabric back to the PROX gen
VM.
Let’s focus on seeing the RX in vif 0/0 and TX in vif0/3, and the data shows lcore#11
received the packets from vif0/0 and forwarded them out of vif0/3, the forwarding
path is illustrated below:
RX TX
fabric: bond0 => vif0/0 => lcore#11 => vif0/3 => tap41a9ab05-64 => VM
To confirm if this lcore#11 that is “forwarding” packets is also the one that did the
“polling”, we’ll need to look at the vif capture and looking for the “RX queue
packets” counter as what we’ve seen before:
[root@a7s3 ~]# contrail-tools vif --get 0 --core 10
Vrouter Interface Table
......
vif0/0 PCI: 0000:00:00.0 (Speed 20000, Duplex 1) NH: 4
Type:Physical HWaddr:90:e2:ba:c3:af:20 IPaddr:0.0.0.0
Vrf:0 Mcast Vrf:65535 Flags:TcL3L2VpVofEr QOS:-1 Ref:18
Core 10 RX device packets:3481584 bytes:619708685 errors:0 RX queue errors to lcore
0 0 0 0 0 0 0 0 0 0 0 0
Fabric Interface: eth_bond_bond0 Status: UP Driver: net_bonding
Slave Interface(0): 0000:02:00.0 Status: UP Driver: net_ixgbe
Slave Interface(1): 0000:02:00.1 Status: UP Driver: net_ixgbe
Vlan Id: 101 VLAN fwd Interface: vfw
Core 10 RX packets:676 bytes:106243 errors:0
Core 10 TX packets:3482241 bytes:605899226 errors:0
Drops:99
Core 10 TX device packets:3482474 bytes:619966089 errors:0
......
vif0/0 PCI: 0000:00:00.0 (Speed 20000, Duplex 1) NH: 4
Type:Physical HWaddr:90:e2:ba:c3:af:20 IPaddr:0.0.0.0
Vrf:0 Mcast Vrf:65535 Flags:TcL3L2VpVofEr QOS:-1 Ref:18
RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 0 0
Fabric Interface: eth_bond_bond0 Status: UP Driver: net_bonding
Slave Interface(0): 0000:02:00.0 Status: UP Driver: net_ixgbe
Slave Interface(1): 0000:02:00.1 Status: UP Driver: net_ixgbe
Vlan Id: 101 VLAN fwd Interface: vfw
Core 11 RX packets:3594939 bytes:625517508 errors:0
Core 11 TX packets:166 bytes:133391 errors:0
Drops:99
There isn’t an “RX queue packets” counter like the one we saw in the traffic on the
193 DPDK vRouter Case Studies
This concludes our analysis to the bidirectional single flow traffic. As you can see,
one benefit of having traffic generator/swapper built into lab environment is that
you can fine tune the generator to send traffic in a very specific pattern, so that you
can take a deep look at the counters and analyze the vRouter traffic forwarding
behavior. This is very helpful for learning purposes, but in production, you prob-
ably never expect to have such a luxury since the traffic pattern in the field is usu-
ally much more complex. But don’t worry, you can add more and more
complexities to our traffic pattern so eventually you will see something close to
194 Chapter 6: Contrail DPDK vRouter Toolbox and Case Study
Multiple Flows
Now we are sending 64 flows from PROX gen VM. To confirm the flow numbers
let’s use the flow -s command in contrail-tools:
[root@a7s3 ~]# contrail-tools flow -s
Flow Statistics
---------------
Total Entries --- Total = 132, new = 0
Active Entries --- Total = 132, new = 0
Hold Entries --- Total = 0, new = 0
Fwd flow Entries - Total = 132
drop flow Entries - Total = 0
NAT flow Entries - Total = 0
You can see 132 flows entries meaning 66 groups of flows in our test. The addi-
tional two groups of flows are the control flows between the jump VM and gen
VM. Good, let’s collect the traffic statistics:
[root@a7s3 ~]# contrail-tools vif --clear
tap interface and lcores never changes. In this case it’s always lcore#11 polling the
traffic and distributing to lcore#10, hence we’ll always see packet being forwarded
by lcore#10 instead of lcore#11, regardless of number of flows and traffic volumes.
On the other direction, if we enable the returning traffic, we’ll see on VIF0 (vif0/0)
the two lcores’ traffic are RX pps: 41547, and RX pps: 44257, which is well balanced
because we have two queues enabled on the vif0/0:
[root@a7s3 ~]# contrail-tools dpdkvifstats.py -all -c 2
| VIF 3 | Core 1 | TX pps: 41249 | RX pps: 85182 | TX bps: 40919336 | RX bps: 84500544
| VIF 3 | Core 2 | TX pps: 43936 | RX pps: 1 | TX bps: 43584072 | RX bps: 336
| VIF 0 | Core 1 | TX pps: 85765 | RX pps: 41547 | TX bps: 119382912 | RX bps: 57825008
| VIF 0 | Core 2 | TX pps: 3 | RX pps: 44257 | TX bps: 18216 | RX bps: 61604304
------------------------------------------------------------------------
| pps per Core |
------------------------------------------------------------------------
|Core 1 |TX + RX pps: 253763 | TX pps 127025 | RX pps 126738 |
|Core 2 |TX + RX pps: 88197 | TX pps 43939 | RX pps 44258 |
------------------------------------------------------------------------
|Total |TX + RX pps: 341960 | TX pps 170964 | RX pps 170996 |
------------------------------------------------------------------------
With a single queue in the VM tap interface, it’s hard to achieve good load balance
between lcores on the vRouter interface facing the VM. Sometimes you need to
enable multiple queues to make better use of all your DPDK forwarding lcores.
This concludes our analysis on one single queue test, and we’ll go ahead to test
multiple queues.
Multiple Queues
Finally, let’s look at a multiple queue example. Based on the previous setup, this
time we added one more queue in the tap interface of VM gen and then collect the
core interface mapping (see Figure 6.7):
[root@a7s3 ~]# contrail-tools dpdkinfo -c
No. of forwarding lcores: 2
No. of interfaces: 5
Lcore 0:
Interface: bond0.101 Queue ID: 0
Interface: vhost0 Queue ID: 0
Interface: tap41a9ab05-64 Queue ID: 1
Lcore 1:
Interface: bond0.101 Queue ID: 1
Interface: tap41a9ab05-64 Queue ID: 0
So most items remain the same, except we have one more queue added on tap in-
terface and the vRouter interface to which it attaches. Correspondingly, one core is
allocated to serve this new queue. Before this new queue was created, we already
knew that each of our lcores was serving the same amount of queues; therefore, as
a tie breaker, which we’ve mentioned when we introduced dpdkinfo -c previously,
196 Chapter 6: Contrail DPDK vRouter Toolbox and Case Study
the first forwarding lcore, lcore#10 with our notation, is allocated for the new
queue.
Let’s check the traffic distribution between lcores with multiple queues on VM tap
interface:
[root@a7s3 ~]# contrail-tools dpdkvifstats.py -all -c 2
| VIF 3 | Core 1 | TX pps: 41319 | RX pps: 42606 | TX bps: 40988672 | RX bps: 42264712
| VIF 3 | Core 2 | TX pps: 43889 | RX pps: 42604 | TX bps: 43537008 | RX bps: 42262288
| VIF 0 | Core 1 | TX pps: 42923 | RX pps: 41540 | TX bps: 59748824 | RX bps: 57815160
| VIF 0 | Core 2 | TX pps: 42918 | RX pps: 44320 | TX bps: 59741640 | RX bps: 61693328
------------------------------------------------------------------------
| pps per Core |
------------------------------------------------------------------------
|Core 1 |TX + RX pps: 168416 | TX pps 84258 | RX pps 84158 |
|Core 2 |TX + RX pps: 173731 | TX pps 86807 | RX pps 86924 |
------------------------------------------------------------------------
|Total |TX + RX pps: 342147 | TX pps 171065 | RX pps 171082 |
------------------------------------------------------------------------
Now we have multiple queues on both the VM tap interface and the fabric inter-
face. Traffic on all lcores is well balanced. Please keep this in mind as an ideal traf-
fic profile that we expect the vRouter to have. In production, we usually deal with
more complicated vRouter lcore configurations and traffic profiles, so the lcore
balancing may not appear as perfect as what we are seeing in lab environment, but
at least you have a good baseline in mind and know what to look for when the re-
sult is far worse than expected.