Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

The IP lookup mechanism in a Linux software router: Performance evaluation and optimizations

2007 IEEE Workshop on High Performance Switching and Routing, HPSR, 2007
...Read more
The IP Lookup Mechanism in a Linux Software Router: Performance Evaluation and Optimizations Raffaele Bolla, Roberto Bruschi DIST - Department of Communications, Computer and Systems Science University of Genoa Via Opera Pia 13, 16145 Genova, Italy {raffaele.bolla, roberto.bruschi}@unige.it Abstract—In the last years, networking device architectures based on Open Source Software, like Linux, have aroused lively interest from both Scientific and Industrial Communities. The key to this success can probably be found in the flexibility and fast development time of the Software approach, and the reliability level guaranteed by large communities of users and developers. In this contest, our aim is to customize, to analyze and to optimize a Linux based architecture for an exclusive networking use. In particular, the objective of this work is to study, to evaluate the performance and to optimize the IP lookup mechanism included in the Linux kernel. With this aim we have released a set of kernel patches to enhance the performance of the system by optimizing the IP lookup mechanism. We present a complete set of benchmarking results with both internal and external measurements. Keywords-Open Router; IP lookup; Linux Router. I. INTRODUCTION In the last years, network equipment architectures based on Open Source software have received a great interest from the Scientific and Industrial Communities. Some well known projects (i.e., the Linux kernel itself, the Click Modular Router [1], Zebra [2] and Xorp [3]) have been successfully activated and carried out with the main aim of developing a complete IPv4 routing platform based on Open Source Software (SW) and COTS Hardware (HW). In the following, we refer to this platform as Open Router (OR). Some interesting works regarding ORs can be found in the scientific literature. Among the others, in [4], [5] and [6], Bianco et al. report a detailed performance evaluation. [7] and [8] propose a router architecture based on PC cluster, while [9] reports some performance results (in packet transmission and reception) obtained with a PC Linux-based testbed. In our previous works, carried out inside the BORABORA project [10] [11] [12], we have focused our attention on testing, optimizing and evaluating the performance of the basic packet forwarding functionalities. To this purpose, besides classical external (throughput and latency) measurements, we have adopted also profiling tools to analyze in depth and to optimize the internal behaviour of a Linux based OR. In [13], we have proposed a survey on how the presence of control plane functionalities in Software Router can impact on the forwarding performance. The previous work concerning Linux Software Router shows that large interesting areas still require a deeper investigation: the identification of the most appropriate HW structures, the comparison of different SW solutions, the identification of the best SW configurations with an indication of the most significant parameters, the accurate characterization of the open node performance, the identification of SW and HW bottlenecks and the effectiveness and influence of advanced functionalities (e.g. firewalling, QoS, etc.) are only some of the open issues in this field of research. Starting from a optimized Linux OR architecture, the objective of this work is to focus our attention on the IP lookup mechanism included in the last generation Linux kernel. With this aim, we present a detailed analysis of the Linux IP lookup mechanism, its performance evaluation, and some kernel optimizations to enhance the whole system behaviour. In particular, we focus our attention both on the RT cache performance and on the performance comparison of the two IP lookup mechanism included in the Linux kernel: the Radix and the LC-trie. In [14] is reported an interesting and detailed analysis of LC-trie mechanism, its Linux implementation is taken into account and tested, but, contrary to the performance evaluation reported in this work, the IP lookup has been tested as a standalone module (i.e., without the complete Linux forwarding chain). The paper is organized as in the following. Section II includes the SW OR architecture description, while in Section III we report a detailed description of the proposed kernel patches/enhancements. In Section III and Section IV, the benchmarking scenario and the performance results are reported. The conclusions are in Section V. II. THE LINUX OR ARCHITECTURE As outlined in [12], while all the forwarding functions are realized inside the Linux kernel, the large part of the control and monitoring operations (the signalling protocols like, for example, routing protocols, control protocols, …) are daemons/applications running in user mode. Thus, we have to outline that, unlike most of the high-end commercial network equipments, the forwarding functionalities and the control ones have to share the CPUs in the system. [13] reports a detailed description of how the resource sharing between the control plane and the forwarding process can have effect on the overall performance in different OR configurations (e.g., SMP kernel, single processor kernel, etc.). The critical element for the IP forwarding is the kernel where all the link, network and transport layer operations are realized. During the last years, the networking support integrated in the Linux kernel has experienced many structural and refining developments. For these reasons, we have chosen to use a last generation Linux kernel, more in particular a 2.6 version. In the following sub- sections, we report a short overview of the Data Plane architecture and a description of the IP lookup mechanism included in the Linux kernel. Since our main aim is to study
and to optimize the OR performance focusing on the IP lookup mechanism, in the following two sub-Sections, we describe a description of the whole Linux forwarding architecture, and of the IP lookup mechanism, respectively. A. The Linux Data Plane Architecture Since the older kernel versions, the Linux networking architecture is fundamentally based on an interrupt mechanism: network boards signal the kernel upon packet reception or transmission, through HW interrupts. Each HW interrupt is served as soon as possible by a handling routine, which suspends the operations currently processed by the CPU. Until completed, the runtime cannot be interrupted by anything, even by other interrupt handlers. To make reactive the system, the interrupt handlers are designed to be very short, while all the time consuming task are performed by the so called “Software Interrupts” (SoftIRQ) in a second time. This is the well known “top half – bottom half” IRQ routine division implemented in the Linux kernel [15]. SoftIRQs are actually a form of kernel activity that can be scheduled for later execution rather than real interrupts. They differ from HW IRQs mainly in that a SoftIRQ is scheduled for execution by an activity of the kernel, like for example a HW IRQ routine, and has to wait until it is called by the scheduler. SoftIRQs can be interrupted only by HW IRQ routines. The “NET_TX_SOFTIRQ” and the “NET_RX_SOFTIRQ” are two of the most important SortIRQs in the Linux kernel and the backbone of the whole networking architecture, since they are designed to manage the packet transmission and reception operations, respectively. In details, the forwarding process is triggered by a HW IRQ generated from a network device, which signals the reception or the transmission of packets. Then the corresponding routine makes some fast checks, and schedules the correct SoftIRQ, which is activated by the kernel scheduler as soon as possible. When the SoftIRQ is finally executed, it performs all the packet forwarding operations. About the packet forwarding process, it is fundamentally composed by a chain of three different modules: a “reception API” that handles the packet reception (NAPI), a module that carries out the IP layer elaboration and, finally, a “transmission API” that manages the forwarding operations to the egress network interfaces. In particular, the reception and the transmission APIs are the lowest level modules, and are composed by both HW IRQ routines and SoftIRQs. They work by managing the network interfaces and performing some layer 2 functionalities. Analyzing more in details, the modules that compose the forwarding chain, the NAPI [16] was introduced in the 2.4.27 kernel version, and it has been explicitly created to increase the reception process scalability. It handles network interface requests with a interrupt moderation mechanism, which allows to adaptively switch from a classical interrupt management of the network interfaces to a polling one. In particular, this is done by inserting, during the HW IRQ routine, the identifier of the board generating the IRQ to a special list, called “poll list” and scheduling a reception SoftIRQ, and disabling the HW IRQs for that device. When the SoftIRQ is activated, the kernel polls all the devices, whose identifier is included in the poll list, and a maximum of quota packets are served per device. If the board buffer (RxRing) is emptied, then the identifier is removed from the poll list and its HW IRQs re-enabled, otherwise its HW IRQ are left disabled, the identifier kept in the poll list and a further SoftIRQ scheduled. While this mechanism behaves like a pure interrupt mechanism in presence of low ingress rate (i.e., we have more or less a HW IRQ per packet), when traffic raises, the probability to empty the RxRing, and so to re-enable HW IRQs, decreases more and more, and the NAPI starts working like a polling mechanism. For each packet, received during the NAPI processing, a descriptor, called skbuff, is immediately allocated and used for all the layer 2 and 3 operations. A packet is elaborated in the same NET_RX SoftIRQ, till it is enqueued in an egress device buffer, called Qdisc. Each time a NET_TX SoftIRQ is activated or a new packet is enqueued, the Qdisc buffer is served. When a packet is dequeued from the Qdisc buffer, it is placed on the Tx Ring of the egress device. After the board transmits one or more packets successfully, it generates a HW IRQ, whose routine schedules a NET_TX SoftIRQ. During a NET_TX SoftIRQ, the Tx Ring is cleaned of all the descriptors of transmitted packets, that will be de-allocated, and refilled by the packets coming from the Qdisc buffer. B. The IP lookup mechanism The IP lookup mechanism included in the Linux kernel is quite complex and complete. In particular, since the 2.1 version, Linux has supported an advanced routing mechanism, which is substantially based on three elements: multiple Routing Tables (RTs), a Routing Policy Database (RPDB) and a Routing Cache, where the most recently used RT entries are stored. While the cache is essential to speed up the look up performance, the presence of multiple RTs and of a RPDB increases the flexibility level of the Linux forwarding functionalities. A flow diagram of the lookup mechanism is reported in Fig. 1. By analyzing in detail the whole mechanism structure, we have to outline that, when determining the route by which to send a packet, the kernel always consults the routing cache first. The routing cache is substantially a hash table used for quick access to recently used routes. If the right entry is found in the routing cache, it will be used to forward the current packet, otherwise, to determine the right route, the kernel will invoke the RPDB. The RPDB is the core element of IP lookup, and can be defined as a cohesive set of route tables and rules. Thus, the main functionalities provided by the RPDB are the mechanism for implementing the rule element of Policy Routing, and the capability to address multiple routing tables. In particular, the RPDB sustains a maximum of 255 routing tables and 2 32 rules, which potentially corresponds to a rule for each IPv4 address. In such context, a rule may be considered as the filter or selection agent for applying Policy Routing and for choosing the suitable RT for each incoming packet. Rules point to routing tables, several rules may refer to one routing table, and some routing tables may have no rules pointing to them. The RPDB can act on the basis of different parameters, like, for example, the packet source address, the ToS flags, the so called “fwmark” (e.g., the packet filtering tag used by netfilter [17] for matching IP protocols and transport ports), and the reception interface name. For each matching rule in the RPDB, the kernel tries to find a matching route to the destination IP address in the selected RT using a longest prefix match selection algorithm. At this purpose, since kernel 2.6.13, Linux supports two search algorithms: the Radix search tree [18], and the LC-trie [19]. When a matching destination is found, the kernel will select the matching route, and will forward the
The IP Lookup Mechanism in a Linux Software Router: Performance Evaluation and Optimizations Raffaele Bolla, Roberto Bruschi DIST - Department of Communications, Computer and Systems Science University of Genoa Via Opera Pia 13, 16145 Genova, Italy {raffaele.bolla, roberto.bruschi}@unige.it Abstract—In the last years, networking device architectures based on Open Source Software, like Linux, have aroused lively interest from both Scientific and Industrial Communities. The key to this success can probably be found in the flexibility and fast development time of the Software approach, and the reliability level guaranteed by large communities of users and developers. In this contest, our aim is to customize, to analyze and to optimize a Linux based architecture for an exclusive networking use. In particular, the objective of this work is to study, to evaluate the performance and to optimize the IP lookup mechanism included in the Linux kernel. With this aim we have released a set of kernel patches to enhance the performance of the system by optimizing the IP lookup mechanism. We present a complete set of benchmarking results with both internal and external measurements. Keywords-Open Router; IP lookup; Linux Router. I. INTRODUCTION In the last years, network equipment architectures based on Open Source software have received a great interest from the Scientific and Industrial Communities. Some well known projects (i.e., the Linux kernel itself, the Click Modular Router [1], Zebra [2] and Xorp [3]) have been successfully activated and carried out with the main aim of developing a complete IPv4 routing platform based on Open Source Software (SW) and COTS Hardware (HW). In the following, we refer to this platform as Open Router (OR). Some interesting works regarding ORs can be found in the scientific literature. Among the others, in [4], [5] and [6], Bianco et al. report a detailed performance evaluation. [7] and [8] propose a router architecture based on PC cluster, while [9] reports some performance results (in packet transmission and reception) obtained with a PC Linux-based testbed. In our previous works, carried out inside the BORABORA project [10] [11] [12], we have focused our attention on testing, optimizing and evaluating the performance of the basic packet forwarding functionalities. To this purpose, besides classical external (throughput and latency) measurements, we have adopted also profiling tools to analyze in depth and to optimize the internal behaviour of a Linux based OR. In [13], we have proposed a survey on how the presence of control plane functionalities in Software Router can impact on the forwarding performance. The previous work concerning Linux Software Router shows that large interesting areas still require a deeper investigation: the identification of the most appropriate HW structures, the comparison of different SW solutions, the identification of the best SW configurations with an indication of the most significant parameters, the accurate characterization of the open node performance, the identification of SW and HW bottlenecks and the effectiveness and influence of advanced functionalities (e.g. firewalling, QoS, etc.) are only some of the open issues in this field of research. Starting from a optimized Linux OR architecture, the objective of this work is to focus our attention on the IP lookup mechanism included in the last generation Linux kernel. With this aim, we present a detailed analysis of the Linux IP lookup mechanism, its performance evaluation, and some kernel optimizations to enhance the whole system behaviour. In particular, we focus our attention both on the RT cache performance and on the performance comparison of the two IP lookup mechanism included in the Linux kernel: the Radix and the LC-trie. In [14] is reported an interesting and detailed analysis of LC-trie mechanism, its Linux implementation is taken into account and tested, but, contrary to the performance evaluation reported in this work, the IP lookup has been tested as a standalone module (i.e., without the complete Linux forwarding chain). The paper is organized as in the following. Section II includes the SW OR architecture description, while in Section III we report a detailed description of the proposed kernel patches/enhancements. In Section III and Section IV, the benchmarking scenario and the performance results are reported. The conclusions are in Section V. II. THE LINUX OR ARCHITECTURE As outlined in [12], while all the forwarding functions are realized inside the Linux kernel, the large part of the control and monitoring operations (the signalling protocols like, for example, routing protocols, control protocols, …) are daemons/applications running in user mode. Thus, we have to outline that, unlike most of the high-end commercial network equipments, the forwarding functionalities and the control ones have to share the CPUs in the system. [13] reports a detailed description of how the resource sharing between the control plane and the forwarding process can have effect on the overall performance in different OR configurations (e.g., SMP kernel, single processor kernel, etc.). The critical element for the IP forwarding is the kernel where all the link, network and transport layer operations are realized. During the last years, the networking support integrated in the Linux kernel has experienced many structural and refining developments. For these reasons, we have chosen to use a last generation Linux kernel, more in particular a 2.6 version. In the following subsections, we report a short overview of the Data Plane architecture and a description of the IP lookup mechanism included in the Linux kernel. Since our main aim is to study and to optimize the OR performance focusing on the IP lookup mechanism, in the following two sub-Sections, we describe a description of the whole Linux forwarding architecture, and of the IP lookup mechanism, respectively. A. The Linux Data Plane Architecture Since the older kernel versions, the Linux networking architecture is fundamentally based on an interrupt mechanism: network boards signal the kernel upon packet reception or transmission, through HW interrupts. Each HW interrupt is served as soon as possible by a handling routine, which suspends the operations currently processed by the CPU. Until completed, the runtime cannot be interrupted by anything, even by other interrupt handlers. To make reactive the system, the interrupt handlers are designed to be very short, while all the time consuming task are performed by the so called “Software Interrupts” (SoftIRQ) in a second time. This is the well known “top half – bottom half” IRQ routine division implemented in the Linux kernel [15]. SoftIRQs are actually a form of kernel activity that can be scheduled for later execution rather than real interrupts. They differ from HW IRQs mainly in that a SoftIRQ is scheduled for execution by an activity of the kernel, like for example a HW IRQ routine, and has to wait until it is called by the scheduler. SoftIRQs can be interrupted only by HW IRQ routines. The “NET_TX_SOFTIRQ” and the “NET_RX_SOFTIRQ” are two of the most important SortIRQs in the Linux kernel and the backbone of the whole networking architecture, since they are designed to manage the packet transmission and reception operations, respectively. In details, the forwarding process is triggered by a HW IRQ generated from a network device, which signals the reception or the transmission of packets. Then the corresponding routine makes some fast checks, and schedules the correct SoftIRQ, which is activated by the kernel scheduler as soon as possible. When the SoftIRQ is finally executed, it performs all the packet forwarding operations. About the packet forwarding process, it is fundamentally composed by a chain of three different modules: a “reception API” that handles the packet reception (NAPI), a module that carries out the IP layer elaboration and, finally, a “transmission API” that manages the forwarding operations to the egress network interfaces. In particular, the reception and the transmission APIs are the lowest level modules, and are composed by both HW IRQ routines and SoftIRQs. They work by managing the network interfaces and performing some layer 2 functionalities. Analyzing more in details, the modules that compose the forwarding chain, the NAPI [16] was introduced in the 2.4.27 kernel version, and it has been explicitly created to increase the reception process scalability. It handles network interface requests with a interrupt moderation mechanism, which allows to adaptively switch from a classical interrupt management of the network interfaces to a polling one. In particular, this is done by inserting, during the HW IRQ routine, the identifier of the board generating the IRQ to a special list, called “poll list” and scheduling a reception SoftIRQ, and disabling the HW IRQs for that device. When the SoftIRQ is activated, the kernel polls all the devices, whose identifier is included in the poll list, and a maximum of quota packets are served per device. If the board buffer (RxRing) is emptied, then the identifier is removed from the poll list and its HW IRQs re-enabled, otherwise its HW IRQ are left disabled, the identifier kept in the poll list and a further SoftIRQ scheduled. While this mechanism behaves like a pure interrupt mechanism in presence of low ingress rate (i.e., we have more or less a HW IRQ per packet), when traffic raises, the probability to empty the RxRing, and so to re-enable HW IRQs, decreases more and more, and the NAPI starts working like a polling mechanism. For each packet, received during the NAPI processing, a descriptor, called skbuff, is immediately allocated and used for all the layer 2 and 3 operations. A packet is elaborated in the same NET_RX SoftIRQ, till it is enqueued in an egress device buffer, called Qdisc. Each time a NET_TX SoftIRQ is activated or a new packet is enqueued, the Qdisc buffer is served. When a packet is dequeued from the Qdisc buffer, it is placed on the Tx Ring of the egress device. After the board transmits one or more packets successfully, it generates a HW IRQ, whose routine schedules a NET_TX SoftIRQ. During a NET_TX SoftIRQ, the Tx Ring is cleaned of all the descriptors of transmitted packets, that will be de-allocated, and refilled by the packets coming from the Qdisc buffer. B. The IP lookup mechanism The IP lookup mechanism included in the Linux kernel is quite complex and complete. In particular, since the 2.1 version, Linux has supported an advanced routing mechanism, which is substantially based on three elements: multiple Routing Tables (RTs), a Routing Policy Database (RPDB) and a Routing Cache, where the most recently used RT entries are stored. While the cache is essential to speed up the look up performance, the presence of multiple RTs and of a RPDB increases the flexibility level of the Linux forwarding functionalities. A flow diagram of the lookup mechanism is reported in Fig. 1. By analyzing in detail the whole mechanism structure, we have to outline that, when determining the route by which to send a packet, the kernel always consults the routing cache first. The routing cache is substantially a hash table used for quick access to recently used routes. If the right entry is found in the routing cache, it will be used to forward the current packet, otherwise, to determine the right route, the kernel will invoke the RPDB. The RPDB is the core element of IP lookup, and can be defined as a cohesive set of route tables and rules. Thus, the main functionalities provided by the RPDB are the mechanism for implementing the rule element of Policy Routing, and the capability to address multiple routing tables. In particular, the RPDB sustains a maximum of 255 routing tables and 232 rules, which potentially corresponds to a rule for each IPv4 address. In such context, a rule may be considered as the filter or selection agent for applying Policy Routing and for choosing the suitable RT for each incoming packet. Rules point to routing tables, several rules may refer to one routing table, and some routing tables may have no rules pointing to them. The RPDB can act on the basis of different parameters, like, for example, the packet source address, the ToS flags, the so called “fwmark” (e.g., the packet filtering tag used by netfilter [17] for matching IP protocols and transport ports), and the reception interface name. For each matching rule in the RPDB, the kernel tries to find a matching route to the destination IP address in the selected RT using a longest prefix match selection algorithm. At this purpose, since kernel 2.6.13, Linux supports two search algorithms: the Radix search tree [18], and the LC-trie [19]. When a matching destination is found, the kernel will select the matching route, and will forward the packet. If no matching entry is found in the selected routing table, the kernel will pass to the next rule in the RPDB, until it finds a match or falls through the end of the RPDB and all routing tables are consulted. Garbage Collection Trash the oldest entries Cache Incoming packet Forward to the selected device Hit Cache lookup Miss RT1 RT2 Add a new entry RT3 RPDB RT4 Select a RT RT5 RT lookup Figure 1. The IP lookup mechanism included in the Linux kernel. III. SYSTEM OPTIMIZATION AND TUNING The whole networking kernel architecture is quite complex and it has many aspects and parameters that should be refined or tuned for a system optimization. For what concerns the parameter tuning, we have used the same configuration adopted in [11]. The 2.6.16.13 kernel images, used to obtain the numerical results in Section V, include four structural patches that we have produced with the aim of testing and/or optimizing the kernel functionalities. In particular, the used patches are described in the following. A. Skbuff Recycling patch In [20], we studied and realized a new version of the skbuff Recycling patch, originally proposed by R. Olsson [21] for the “e1000” driver. In particular, the new version is stabilized for 2.6.16.13 kernel version, and extended to the “sundance” driver. This patch allows to intercept the skbuff descriptors of sent packets before their deallocation, and to re-use them for new incoming packets. As shown in [11], this architectural change heavily reduces the computation weight of the memory management operations, and it allows to achieve a very high performance level (i.e., about the 150-175% of the maximum throughput of standard kernels). B. Destination based RT Cache patch As previously sketched, the Routing Cache is composed by a hash table, where the most recently used routes are stored. The hash key, computed for each incoming packet, does not univocally match a route, since it is a 32 bit variable that is computed starting from the values of both IP source and destination addresses and of ToS flags. Thus, once the hash key is found, it potentially points to a route list, which is linearly scrolled until all the selectors (i.e., source and destination IP addresses, ToS, fwmarks, incoming device, etc.) agree. When the Policy Routing is not active and routing decisions are taken only on the basis of IP destination values, this kind of cache structure leads to heavy performance limits. For example, as shown in the Section V, when the number of source IP addresses of incoming packets increases, the memory needed by the cache will consistently grows, since a new cache entry is needed at least for each source IP address value. Starting from these considerations, we have decided to propose a new cache architecture to be used without Policy Routing. In this patch the hash key is obtained starting only by the destination address, and only the “fwmark” is used as route selector. C. RT Cache disabled patch To better evaluate the performance level of the two RT lookup algorithms included in the Linux kernel (i.e., the Radix tree and the LC-trie), we have used a path to disable the RT cache. In this way, all the routes of incoming packets are always computed by these algorithms. D. Performance Counter patch To study in depth the OR internal behaviour, we have decided to introduce a set of counters in the kernel source code, with the aim of understanding how many times a certain procedure is called, or how many packets are kept per time. In detail, we have introduced the following counters: IRQ: number of interrupt handlers generated by a network card; tx/rx IRQ: number of tx/rx IRQ routine per device; tx/rx SoftIRQ: number of tx/rx software IRQ routines; Qdiscrun and Qdiscpkt: number of times when the output buffer (Qdisc) is served, and number of served packets per time; Pollrun and Pollpkt: number of times when the rx ring of a device is served, and the number of served packets per time; tx/rx clean: number of times when the tx/rx procedures of the driver are activated. IV. BENCHMARKING TOOLS To benchmark the OR forwarding performance, we have used a professional equipment, namely Agilent N2X Router Tester [22], which allows to obtain throughput and latency measurements with very high availability and accuracy levels (i.e., the minimum guaranteed timestamp resolution is 10 ns). To better support the performance analysis and to identify the OR bottlenecks, we have also performed some internal measurements by using specific SW tools (called profilers) placed inside the OR, which are able to trace the percentage of CPU utilization for each SW modules running on the node. The problem is that many of these profilers require a relevant computational effort that perturbs the system performance. We have experimentally verified that one of the best is Oprofile [23], an open source tool that realizes a continuous monitoring of system dynamics with a frequent and quite regular sampling of CPU HW registers. Oprofile allows the effective evaluation of the CPU utilization of both each SW application and each single kernel function with a very low computational overhead. V. NUMERICAL RESULTS In this Section, we report some of the numerical results carried out to evaluate the performance of IP lookup mechanism. We propose two different benchmarking test sessions: one for evaluating RT cache scalability and performance, the other to characterize the performance and to take a comparison between the Radix and the LC-trie lookup mechanisms. For each benchmarking session, we show three different kinds of results: external measurements (i.e., throughput and average latency), profiling statistics and performance counters (i.e., the kernel patch in Section III.D). While the external measurements are the classical performance indexes used to benchmark network devices, both the profiling and the internal counters allow us to evaluate OR internal dynamics and features, and to characterize its SW architecture bottlenecks. About the profiling results, we have classified kernel functions in different logical sets, which correspond to the main high level kernel functionalities, like idle, kernel scheduler, memory management, IP processing, NAPI, TxAPI, address number, and according the linear increase of cache dimension, as shown in Fig. 4. Moreover, as we can observe by Fig. 3, when the OR reaches the saturation (over the 50% of offered load), the average delay increases according the raise of source IPs: this is clearly due to the increase of searching times in the RT cache. Figs. 5 and 6 show the profiling results for the 1 and 16384 source IP addresses tests (with reference to Fig.2): in this case, we can outline how, coherently with expectations, the IP processing functions double their whole computational weight. The behaviour of the other functionalities is clearly bound with the number of forwarded packets: the weight of almost all the classes raises linearly up to the saturation point, and after it remains more or less constant. On the contrary IRQ handlers show a particular behaviour, caused by the NAPI paradigm: when the traffic load is low their computational weight is high, since the Rx API works like an interrupt mechanism; while, for higher loads, it starts to lower more and more. When IRQ weight becomes zero, the OR reaches the saturation point, and works like polling mechanism. This is confirmed also by the performance counters reported in Figs. 7 and 8, where we can view both the tx and rx boards reducing their IRQ generation rates, while the kernel passes from polling the rx ring twice per received packet, to about 0.22 times. Also the number of Rx SoftIRQ per received packet decreases as offered traffic load raises. HW IRQ routines, Ethernet Processing and oprofile functions. Moreover, this classification has been lightly changed with respect to other our works on the OR. In all the benchmarking sessions here reported, we have used a traffic matrix composed by a single flow with 64 sized datagrams that crosses the OR between two different Gigabit Ethernet adapters. Moreover, we have chosen to use small sized datagrams, since the computational capacity is the actual performance bottleneck in the OR, as shown in [11]. About the HW architecture, we have used a system based on a Supermicro X5DL8-GG mainboard, 2.4 Ghz Intel Xeon processors, and Intel PRO 1000 XT Server network adapters. A. Benchmarking tests on RT Cache The RT cache is characterized by a per flow structure, since the 32 bit hash key is built according the source and destination IP addresses and the ToS value. While the per flow cache structure is clearly needed if we use advanced routing policies, when routing decisions are made only on the basis of destination IP address, it reduces the scalability level. At this purpose, Figs. 2 and 3 show the throughput and the latency values obtained with a optimized kernel (including the skbuff recycling patch) in presence of a traffic flow composed by datagrams with different source IP addresses. Observing these Figs. we can note how the maximum throughput value decreases according the raising of source 600000 500000 400000 300000 1 1024 4096 16384 200000 100000 IP IP IP IP 0 100000 2000 1800 1600 1400 1200 1000 800 600 400 200 0 1 1024 4096 16384 Slab Objects [#] Packet Latency [us] IP IP IP IP 10000 1000 100 100 0 10 20 30 40 50 60 70 80 90 100 Offered Load [%] 0 10 20 30 40 50 60 70 80 90 100 Offered Load [%] Figure 3. Average latency values of the kernel with skbuff recycling patch and a variable number of IP source addresses. Figure 2. Throughput values of the kernel with skbuff recycling patch and a variable number of IP source addresses. 1000 10000 IP Source Number [#] 100000 Figure 4. Number of total objects and active objects contained in the RT cache for the kernel with skbuff recycling patch, the offered load is 1Gbps. 60 idle scheduler memory IP processing NAPI Tx API IRQ Eth processing oprofile 50 40 30 20 10 0 CPU Utilization [%] 60 CPU Utilization [%] Active Objs Num Objs idle scheduler memory IP processing NAPI Tx API IRQ Eth processing oprofile 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 0 10 20 Offered Load [%] 1.4 1.2 2 1 1.5 0.8 0.6 1 0.4 0.5 0.2 0 0 10 20 30 40 50 40 50 60 70 80 60 70 IRQ Poll Run rxSoftIrq 80 Offered Load [%] Figure 7. Number of IRQ routines, of polls and of Rx SoftIRQ (second yaxe) for the RX board for the skbuff recycling patched kernel, in presence of an incoming traffic flow with only 1 IP source address. Figure 6. Profiling results of the skbuff recycling patched kernel, in presence of an incoming traffic flow with 16384 IP source address values. Occurrences [# / Pkts] 2.5 Occurrences [# / Pkts] Occurrences [# / Pkts] Figure 5. Profiling results of the skbuff recycling patched kernel, in presence of an incoming traffic flow with a single IP source address. 0 30 Offered Load [%] 0.006 0.14 0.005 0.12 0.004 0.1 0.003 0.08 0.002 0.06 0.001 0.04 0 0.02 0 10 20 30 40 50 60 Offered Load [%] 70 Occurrences [# / Pkts] Throughput [Kpkts/s] 700000 IRQ Wake Func 80 Figure 8. Number of IRQ routines for the TX board, of Tx Ring clean by TxSoftIRQ (“func”) and by RxSoftIRQ (“wake”) (second y axe )for the skbuff recycling patched kernel, in the same conditions of Fig. 7. 70 CPU Utilization [%] Throughput [Kpkts/s] 700000 600000 500000 400000 300000 1 1024 4096 16384 200000 100000 IP IP IP IP 0 idle scheduler memory IP processing NAPI Tx API IRQ Eth processing oprofile 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 Offered Load [%] 0 10 20 30 40 50 60 70 80 Offered Load [%] Figure 9. Throughput values of the kernel with skbuff recycling and cache patch, according a variable number of IP source addresses. Figure 10. Profiling results of the skbuff recycling and cache patched kernel, in presence of an incoming flow with 16384 IP source addresses. For what concerns the transmission dynamics, Fig. 8 shows very low function occurrences: with reference to Fig. 8, the Tx IRQ routines low their occurrences up to the saturation, while the “wake” function, which represents the number of times that the Tx Ring is cleaned and the Qdisc buffer is served during a Rx SoftIRQ, shows a mirror behaviour: this because when the OR reaches the saturation, all the tx functionalities are activated when the Rx SoftIRQ starts. Fig. 9 reports the throughput values obtained with the cache patch described in Section III.B: as you can observe, with a destination based cache and using no advanced routing policies, there is not any performance decays according different source IP address ranges. In fact, in all the performed tests, the number of RT cache entries is resulted to be low and constant, guaranteeing, in this way, very fast cache lookup times (all the obtained results look very similar to the standard case with only 1 source IP address). B. IP lookup benchmarking In this second test session, we show a comparison between the two IP lookup mechanisms included in the Linux kernel: the Radix tree and the LC-trie. At this purpose, in all the tests reported in this Section, we use a traffic flow composed by a variable range of random IP destination addresses that hit uniformly the RT. Thus, we have performed a first test, using a skbuff recycling patched kernel, with the aim of analyzing the OR scalability level, when the number of IP destination addresses rises. Like in the case of Fig. 2, an arising number of IP addresses causes an enlargement of the RT cache, and a consequent increase in searching times. This is clearly observable in Fig. 12, where the obtained throughput values show a very similar behaviour to those in Fig. 2: the performance decay is not caused by the real IP lookup mechanism (since it activated only for the first datagram with a new IP destination address, and afterwards a new cache entry will be allocated and used), but it is substantially due to the searching time increase in the cache table hash, in presence of 256, 1024, 8192 and 16384 entries, respectively. Therefore, to better analyze and compare the performance of the Radix and LC-trie IP lookup mechanisms, we have decided to disable the cache with the kernel patch introduced in Section III.C. In such environment, we have repeated the previous test for both the mechanisms, and Figs. 11, 12 , 13 and 14 show the external measurement results. The Radix mechanism appears to have a lower computational complexity with respect to the LC-trie, since it allows higher throughputs (i.e., the difference is more or less 30-50 kpps) and lower packet latencies. The profiling results, shown in Figs. 15, 16, 17 and 18, confirm this trend, as IP functions have lower CPU utilization for the Radix kernel. However this performance difference seems to reduce as the RT entry number rises. 500 400 300 1 1024 4096 16384 200 100 entries entries entries entries 0 400000 Throughput [Kpkts/s] 400000 600 Throughput [Kpkts/s] Throughput [Kpkts/s] 700 350000 300000 250000 200000 150000 100000 256 1024 8192 16384 entries entries entries entries 50000 0 10 20 30 40 50 60 70 80 90 100 Offered Load [%] 350000 300000 250000 200000 256 1024 8192 16384 150000 100000 entries entries entries entries 50000 0 10 20 30 40 50 60 70 80 90 100 Offered Load [%] 0 10 20 30 40 50 60 70 80 90 100 Offered Load [%] 4500 3500 4000 3000 2500 2000 1500 1000 500 256 1024 8192 16384 entries entries entries entries 0 60 3500 3000 2500 2000 1500 1000 500 256 1024 8192 16384 entries entries entries entries 0 CPU Utilization [%] 4000 Packet Latency [us] Packet Latency [us] Figure 11. Throughput values of the kernel Figure 12. Throughput values of the kernel Figure 13. Throughput values of the kernel with skbuff recycling and with skbuff recycling patch, Radix lookup with skb recycling and cache patches, Radix cache disabling patches, LC-trie lookup and with RT with variable and with variable sized RT. lookup and with RT with variable size. size. idle scheduler memory IP processing NAPI Tx API IRQ Eth processing oprofile 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Offered Load [%] Offered Load [%] Figure 14. Latency values of the kernel with Figure 15. Latency values of the kernel with recycling and cache patches, LC-trie lookup recycling and cache patches, Radix lookup and with RT with variable size. and with RT with variable size. 0 10 20 30 40 50 60 70 80 Offered Load [%] Figure 16. Profiling results of the skbuff recycling and disabling cache patched kernel with Radix lookup, in presence of 256 RT entries. [5] CPU Utilization [%] 80 idle scheduler memory IP processing NAPI Tx API IRQ Eth processing oprofile 70 60 50 40 30 [6] 20 10 0 0 10 20 30 40 50 60 70 80 [7] Offered Load [%] Figure 17. CPU Utilization of the most significant kernel functionalities involved in the forwarding process for the skbuff recycling and disabling cache patched kernel with Radix lookup, in presence of 16384 RT entries. [8] CPU Utilization [%] 70 idle scheduler memory IP processing NAPI Tx API IRQ Eth processing oprofile 60 50 40 30 [9] 20 10 0 0 10 20 30 40 50 60 Offered Load [%] 70 80 Figure 18. CPU Utilization of the most significant kernel functionalities involved in the forwarding process for the skbuff recycling and disabling cache patched kernel with LC-trie lookup, in presence of 256 RT entries. [10] [11] CPU Utilization [%] 90 idle scheduler memory IP processing NAPI Tx API IRQ Eth processing oprofile 80 70 60 50 40 30 [12] 20 [13] 10 0 0 10 20 30 40 50 60 Offered Load [%] 70 80 Figure 19. CPU Utilization of the most significant kernel functionalities involved in the forwarding process for the skbuff recycling and disabling cache patched kernel with LC-trie lookup, in presence of 16384 RT entries. VI. CONCLUSIONS In this work, we have introduced a complete study of the IP lookup mechanism included in the Linux kernel: we have shown a detailed performance evaluation, using both internal and external measurements, and some kernel patches to optimize its behavior. In particular, this work offers two main contributions: the scalability level of the per flow RT cache has been tested and compared with a destination based structure, and the Radix and the performance of LC-trie lookup algorithms have been compared and discussed. [1] [2] [3] [4] REFERENCES E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. F. Kaashoek,"The Click modular router", ACM Transactions on Computer Systems 18(3), August 2000, pages 263-297. Zebra, http://www.zebra.org/. M. Handley, O. Hodson, E. Kohler, “XORP: an open platform for network research”, ACM SIGCOMM Computer Communication Review, Vol. 33 Issue 1, Jan 2003, pp. 53-57. A. Bianco, R. Birke, D. Bolognesi, J. M. Finochietto, G. Galante, M. Mellia, M.L.N.P.P. Prashant, Fabio Neri, “Click vs. Linux: Two Efficient Open-Source IP Network Stacks for Software Routers”, Proc. of IEEE 2005 Workshop on High Performance Switching and Routing (HPSR 2005), Hong Kong, May 12-14, 2005, pp. 18-23. [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] A. Bianco, J. M. Finochietto, G. Galante, M. Mellia, F. Neri, “OpenSource PC-Based Software Routers: a Viable Approach to HighPerformance Packet Switching”, Third International Workshop on QoS in Multiservice IP Networks, Catania, Italy, Feb 2005, pp. 353-366. A. Bianco, R. Birke, G. Botto, M. Chiaberge, J. Finochietto, M. Mellia, F. Neri, M. Petracca, G. Galante, “Boosting the Performance of PCBased Software Routers with FPGA-enhanced Network Interface Cards”, Proc. of IEEE 2006 Workshop on High Performance Switching and Routing (HPSR 2006), Poznan, Poland, June 2006, pp. 121-126. B. Chen and R. Morris, "Flexible Control of Parallelism in a Multiprocessor PC Router", Proceedings of the 2001 USENIX Annual Technical Conference (USENIX '01), Boston, Massachusetts, June 2001. C. Duret, F. Rischette, J. Lattmann, V. Laspreses, P. Van Heuven, S. Van den Berghe, P. Demeester, “High Router Flexibility and Performance by Combining Dedicated Lookup Hardware (IFT), off the Shelf Switches and Linux”, Proc. of Second International IFIP-TC6 Networking Conference, Pisa, Italy, May 2002, LNCS 2345, Ed E. Gregori et al, Springer-Verlag 2002, pp. 1117-1122. A. Barczyk, A. Carbone, J.P. Dufey, D. Galli, B. Jost, U. Marconi, N. Neufeld, G. Peco, V. Vagnoni, “Reliability of datagram transmission on Gigabit Ethernet at full link load”, LHCb technical note, LHCB 2004030 DAQ, March 2004. Building Open Router Architectures based on Router Aggregation (BORA BORA) project, www.telematica.polito.it/projects/borabora/. R. Bolla, R. Bruschi, “RFC 2544 Performance Evaluation and Internal Measurements for a Linux Based Open Router”, Proc. of IEEE 2006 Workshop on High Performance Switching and Routing (HPSR 2006), Poznan, Poland, June 2006, pp..9-14. R. Bolla, R. Bruschi, “A high-end Linux based Open Router for IP QoS networks: tuning and performance analysis with internal (profiling) and external measurement tools of the packet forwarding capabilities”, Proc. of the 3rd International Workshop on Internet Performance, Simulation, Monitoring and Measurements (IPS MoMe 2005), Warsaw, Poland, Mar. 2005, pp. 203-214. R. Bolla, R. Bruschi, “IP forwarding Performance Analysis in presence of Control Plane Functionalities in a PC-based Open Router”, Proc. of the 2005 Tyrrhenian International Workshop on Digital Communications (TIWDC 2005), Sorrento, Italy, Jun. 2005, and in F. Davoli, S. Palazzo, S. Zappatore, Eds., “Distributed Cooperative Laboratories: Networking, Instrumentation, and Measuments”, Springer, Norwell, MA, 2006, pp. 143-158. J. Fu, O. Hagsand, G. Karlsson, “Performance Evaluation and Cache Behavior of LC-Trie for IP-Address Lookup”, Proc. of IEEE 2006 Workshop on High Performance Switching and Routing (HPSR 2006), Poznan, Poland, June 2006, pp. 29-36. K. Wehrle, F. Pählke, H. Ritter, D. Müller, M. Bechler, “The Linux Networking Architecture: Design and Implementation of Network Protocols in the Linux Kernel”, Pearson Prentice Hall, Upper Saddle River, NJ , USA, 2004. J. H. Salim, R. Olsson, A. Kuznetsov, “Beyond Softnet”, Proc. of the 5th annual Linux Showcase & Conference, November 2001, Oakland California, USA. The NetFilter project, homepage at www.netfilter.org. S. Nilsson, G. Karlsson, “IP-address lookup using LC-tries”, IEEE Journal on Selected Areas in Communications, vol. 17, no. 6, pp. 10831092, Jun. 1999. D. R. Morrison, “PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric”, Journal of the ACM (JACM), vol. 15 , no. 4, pp. 514 – 534, Oct. 1968. Open Router resources from the TNT Lab., homepage at www.tnt.dist.unige.it. The descriptor recycling patch, ftp://robur.slu.se/pub/Linux/netdevelopment/skb_recycling/. The Agilent N2X Router Tester, http://advanced.comms.agilent.com /n2x/products/index.htm Oprofile, http://oprofile.sourceforge.net/news/.
Keep reading this paper — and 50 million others — with a free Academia account
Used by leading Academics
Dag I K Sjoberg
University of Oslo
Qusay F. Hassan
Yoo Hwan Soo
Gacheon University
John-Mark Agosta
Microsoft