In-Band Network Telemetry (INT) is a novel framework for collecting telemetry items and switch internal state information from the data plane at line rate. With the support of programmable data planes and programming language P4, switches...
moreIn-Band Network Telemetry (INT) is a novel framework for collecting telemetry items and switch internal state information from the data plane at line rate. With the support of programmable data planes and programming language P4, switches parse telemetry instruction headers and determine which telemetry items to attach using custom metadata. At the network edge, telemetry information is removed and the original packets are forwarded while telemetry reports are sent to a distributed stream processor for further processing by a network monitoring platform. In order to avoid excessive load on the stream processor, telemetry items should not be sent for each individual packet but rather when certain events are triggered. In this paper, we develop a programmable INT event detection mechanism in P4 that allows customization of which events to report to the monitoring system, on a per-flow basis, from the control plane. At the stream processor, we implement a fast INT report collector using the kernel bypass technique AF XDP, which parses telemetry reports and streams them to a distributed Kafka cluster, which can apply machine learning, visualization and further monitoring tasks. In our evaluation, we use real-world traces from different data center workloads and show that our approach is highly scalable and significantly reduces the network overhead and stream processor load due to effective event pre-filtering inside the switch data plane. While the INT report collector can process around 3 Mpps telemetry reports per core, using event pre-filtering increases the capacity by 10-15x. I. INTRODUCTION Operations, Administration, and Management (OAM) refers to protocols, tools and mechanisms that help network operators in fault indication, performance monitoring, security management , diagnostic functions, accounting, performance monitoring , configuration and service provisioning. In traditional carrier networks, OAM tools such as SNMP and OWAMP-Test are used, however, these tools have been proven inadequate for SDN-NFV data centers. They are not scalable and cannot provide fine-grained, real-time information about the overall performance of the data center infrastructure [1]. In-band Network Telemetry (INT) has gained a lot of momentum over the last few years [1]-[5]. The idea behind the INT framework is that each node along a network path adds telemetry items and network state to in-band, data plane traffic. Telemetry items may include switch ID, ingress timestamps, queue occupancy information, and various other performance-related metadata, which are added at line rate as customized headers to in-band, data plane packets. The telemetry items are forwarded to a distributed network monitoring platform, which