Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3603269.3604827acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article
Public Access

Direct Telemetry Access

Published: 01 September 2023 Publication History

Abstract

Fine-grained network telemetry is becoming a modern datacenter standard and is the basis of essential applications such as congestion control, load balancing, and advanced troubleshooting. As network size increases and telemetry gets more fine-grained, there is a tremendous growth in the amount of data needed to be reported from switches to collectors to enable network-wide view. As a consequence, it is progressively hard to scale data collection systems.
We introduce Direct Telemetry Access (DTA), a solution optimized for aggregating and moving hundreds of millions of reports per second from switches into queryable data structures in collectors' memory. DTA is lightweight and it is able to greatly reduce overheads at collectors. DTA is built on top of RDMA, and we propose novel and expressive reporting primitives to allow easy integration with existing state-of-the-art telemetry mechanisms such as INT or Marple.
We show that DTA significantly improves telemetry collection rates. For example, when used with INT, it can collect and aggregate over 400M reports per second with a single server, improving over the Atomic MultiLog by up to 16x.

References

[1]
2023. Direct Telemetry Access source code. https://github.com/jonlanglet/DTA. (2023).
[2]
Mohammad Alizadeh, Tom Edsall, Sarang Dharmapurikar, Ramanan Vaidyanathan, Kevin Chu, Andy Fingerhut, Vinh The Lam, Francis Matus, Rong Pan, Navindra Yadav, et al. 2014. CONGA: Distributed congestion-aware load balancing for datacenters. In Proceedings of the 2014 ACM conference on SIGCOMM. 503--514.
[3]
Emmanuel Amaro, Zhihong Luo, Amy Ousterhout, Arvind Krishnamurthy, Aurojit Panda, Sylvia Ratnasamy, and Scott Shenker. 2020. Remote Memory Calls. In Proceedings of the 19th ACM Workshop on Hot Topics in Networks. 38--44.
[4]
Michael P Andersen and David E Culler. 2016. Btrdb: Optimizing storage system design for timeseries processing. In 14th {USENIX} Conference on File and Storage Technologies ({FAST} 16). 39--52.
[5]
Arista. 2022. Telemetry and Analytics. https://www.arista.com/en/solutions/telemetry-analytics. (2022). Accessed: 2022-02-02.
[6]
Ran Ben Basat, Sivaramakrishnan Ramanathan, Yuliang Li, Gianni Antichi, Minian Yu, and Michael Mitzenmacher. 2020. PINT: Probabilistic In-band Network Telemetry. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication. 662--680.
[7]
Theophilus Benson, Aditya Akella, and David A. Maltz. 2010. Network Traffic Characteristics of Data Centers in the Wild. In Conference on Internet Measurement (IMC). ACM.
[8]
BROADCOM. 2017. Trident Programmable Switch. https://www.broadcom.com/products/ethernet-connectivity/switching/strataxgs/bcm56870-series. (2017).
[9]
Andrei Broder and Michael Mitzenmacher. 2004. Network applications of bloom filters: A survey. Internet mathematics 1, 4 (2004), 485--509.
[10]
Cisco. 2019. Explore Model-Driven Telemetry. https://blogs.cisco.com/developer/model-driven-telemetry-sandbox. (2019). Accessed: 2021-06-24.
[11]
Cisco. 2021. How to scale IOS-XR Telemetry with InfluxDB. https://community.cisco.com/t5/service-providers-knowledge-base/how-to-scale-ios-xr-telemetry-with-influxdb/ta-p/4442024. (2021).
[12]
Cisco. 2022. TRex. https://trex-tgn.cisco.com/. (2022). Accessed: 2022-01-25.
[13]
Cisco. 2023. Cisco IOS NetFlow. https://www.cisco.com/c/en/us/products/ios-nx-os-software/ios-netflow/index.html. (2023). Accessed: 2023-02-08.
[14]
Graham Cormode and Shan Muthukrishnan. 2005. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55, 1 (2005), 58--75.
[15]
Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. 2014. FaRM: Fast remote memory. In 11th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 14). 401--414.
[16]
Nick G Duffield and Matthias Grossglauser. 2001. Trajectory sampling for direct traffic observation. IEEE/ACM transactions on networking 9, 3 (2001), 280--292.
[17]
Rodrigo Fonseca, Tianrong Zhang, Karl Deng, and Lihua Yuan. 2019. dShark: A general, easy to program and scalable framework for analyzing in-network packet traces. (2019).
[18]
Sam Gao, Mark Handley, and Stefano Vissicchio. 2021. Stats 101 in P4: Towards In-Switch Anomaly Detection. In Proceedings of the Twentieth ACM Workshop on Hot Topics in Networks. 84--90.
[19]
Michael T Goodrich and Michael Mitzenmacher. 2011. Invertible bloom lookup tables. In 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 792--799.
[20]
Prateesh Goyal, Preey Shah, Kevin Zhao, Georgios Nikolaidis, Mohammad Alizadeh, and Thomas E. Anderson. 2022. Backpressure Flow Control. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). USENIX Association, Renton, WA, 779--805.
[21]
The P4.org Applications Working Group. 2020. Telemetry Report Format Specification. https://github.com/p4lang/p4-applications/blob/master/docs/telemetry_report_latest.pdf. (2020). Accessed: 2021-06-23.
[22]
Chuanxiong Guo, Lihua Yuan, Dong Xiang, Yingnong Dang, Ray Huang, Dave Maltz, Zhaoyi Liu, Vin Wang, Bin Pang, Hua Chen, et al. 2015. Pingmesh: A large-scale system for data center network latency measurement and analysis. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication. 139--152.
[23]
Arpit Gupta, Rob Harrison, Marco Canini, Nick Feamster, Jennifer Rexford, and Walter Willinger. 2018. Sonata: Query-driven streaming network telemetry. In Proceedings of the 2018 conference of the ACM special interest group on data communication. 357--371.
[24]
Chris Hare. 2011. Simple Network Management Protocol (SNMP). (2011).
[25]
Brandon Heller, Srinivasan Seetharaman, Priya Mahadevan, Yiannis Yiakoumis, Puneet Sharma, Sujata Banerjee, and Nick McKeown. 2010. Elastictree: Saving energy in data center networks. In NSDI, Vol. 10. 249--264.
[26]
Qun Huang, Haifeng Sun, Patrick PC Lee, Wei Bai, Feng Zhu, and Yungang Bao. 2020. Omnimon: Re-architecting network telemetry with resource efficiency and full accuracy. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication. 404--421.
[27]
Huawei. 2020. Overview of Telemetry. https://support.huawei.com/enterprise/en/doc/EDOC1000173015/165fa2c8/overview-of-telemetry. (2020). Accessed: 2021-06-24.
[28]
Huawei. 2021. Telemetry. https://support.huawei.com/enterprise/en/doc/EDOC1100196389. (2021).
[29]
IEEE 802.11Qbb. 2011. Priority Based Flow Control.
[30]
Infiniband Trade Association. 2015. InfiniBandTM Architecture Specification. (2015). Volume 1 Release 1.3.
[31]
Intel. 2016. Intel® Tofino™ Series Programmable Ethernet Switch ASIC. https://www.intel.com/content/www/us/en/products/network-io/programmable-ethernet-switch/tofino-series.html. (2016). Accessed: 2022-01-25.
[32]
Intel. 2020. In-band Network Telemetry Detects Network Performance Issues. https://builders.intel.com/docs/networkbuilders/in-band-network-telemetry-detects-network-performance-issues.pdf. (2020). Accessed: 2021-06-04.
[33]
Intel. 2020. Intel® Ethernet Network Adapter E810-CQDA1/CQDA2. https://www.intel.com/content/www/us/en/products/docs/network-io/ethernet/network-adapters/ethernet-800-series-network-adapters/e810-cqda1-cqda2-100gbe-brief.html. (2020). Accessed: 2021-06-11.
[34]
Intel. 2021. Intel Deep Insight Network Analytics Software. https://www.intel.com/content/www/us/en/products/network-io/programmable-ethernet-switch/network-analytics/deep-insight.html. (2021). Accessed: 2021-06-10.
[35]
Intel. 2022. Performance Tuning for Mellanox Adapters. https://support.mellanox.com/s/article/performance-tuning-for-mellanox-adapters. (2022).
[36]
Anuj Kalia, Michael Kaminsky, and David G Andersen. 2016. Design guidelines for high performance {RDMA} systems. In 2016 {USENIX} Annual Technical Conference ({USENIX} {ATC} 16). 437--450.
[37]
Anurag Khandelwal, Rachit Agarwal, and Ion Stoica. 2019. Confluo: Distributed monitoring and diagnosis stack for high-speed networks. In 16th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 19). 421--436.
[38]
Changhoon Kim, Anirudh Sivaraman, Naga Katta, Antonin Bas, Advait Dixit, and Lawrence J Wobker. 2015. In-band network telemetry via programmable dataplanes. In ACM SIGCOMM.
[39]
Daehyeok Kim, Zaoxing Liu, Yibo Zhu, Changhoon Kim, Jeongkeun Lee, Vyas Sekar, and Srinivasan Seshan. 2020. Tea: Enabling state-intensive network functions on programmable switches. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication. 90--106.
[40]
Jan Kučera, Diana Andreea Popescu, Han Wang, Andrew Moore, Jan Kořenek, and Gianni Antichi. 2020. Enabling Event-Triggered Data Plane Monitoring. In Proceedings of the Symposium on SDN Research. Association for Computing Machinery, 14--26.
[41]
Jonatan Langlet, Ran Ben-Basat, Sivaramakrishnan Ramanathan, Gabriele Oliaro, Michael Mitzenmacher, Minlan Yu, and Gianni Antichi. 2021. Zero-CPU Collection with Direct Telemetry Access. In Proceedings of the Twentieth ACM Workshop on Hot Topics in Networks. 108--115.
[42]
Yiran Li, Kevin Gao, Xin Jin, and Wei Xu. 2020. Concerto: cooperative network-wide telemetry with controllable error rate. In Proceedings of the 11th ACM SIGOPS Asia-Pacific Workshop on Systems. 114--121.
[43]
Yuliang Li, Rui Miao, Changhoon Kim, and Minlan Yu. 2016. Flowradar: A better netflow for data centers. In 13th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 16). 311--324.
[44]
Yuliang Li, Rui Miao, Hongqiang Harry Liu, Yan Zhuang, Fei Feng, Lingbo Tang, Zheng Cao, Ming Zhang, Frank Kelly, Mohammad Alizadeh, et al. 2019. HPCC: high precision congestion control. In Proceedings of the ACM Special Interest Group on Data Communication. 44--58.
[45]
Richard J Lipton. 1994. A new approach to information theory. In Annual Symposium on Theoretical Aspects of Computer Science. Springer, 699--708.
[46]
Wassim Mansour, Nicolas Janvier, and Pablo Fajardo. 2019. FPGA implementation of RDMA-based data acquisition system over 100-Gb ethernet. IEEE Transactions on Nuclear Science 66, 7 (2019), 1138--1143.
[47]
Rui Miao, Hongyi Zeng, Changhoon Kim, Jeongkeun Lee, and Minlan Yu. 2017. Silkroad: Making stateful layer-4 load balancing fast and cheap using switching asics. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication. 15--28.
[48]
Microsoft. 2013. Cloud Service Fundamentals: Telemetry - Reporting. https://azure.microsoft.com/sv-se/blog/cloud-service-fundamentals-telemetry-reporting/. (2013).
[49]
Michael Mitzenmacher and Eli Upfal. 2017. Probability and computing: Randomization and probabilistic techniques in algorithms and data analysis. Cambridge university press.
[50]
Tal Mizrahi, Vitaly Vovnoboy, Moti Nisim, Gidi Navon, and Amos Soffer. 2018. Network telemetry solutions for data center and enterprise networks. Marvell, White Paper (2018).
[51]
Srinivas Narayana, Anirudh Sivaraman, Vikram Nathan, Prateesh Goyal, Venkat Arun, Mohammad Alizadeh, Vimalkumar Jeyakumar, and Changhoon Kim. 2017. Language-directed hardware design for network performance monitoring. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication. 85--98.
[52]
APS Networks. 2019. Advanced Programmable Switch. https://www.aps-networks.com/wp-content/uploads/2021/07/210712_APS_BF2556X-1T_V04.pdf. (2019). Accessed: 2022-01-25.
[53]
Juniper Networks. 2021. Overview of the Junos Telemetry Interface. https://www.juniper.net/documentation/us/en/software/junos/interfaces-telemetry/topics/concept/junos-telemetry-interface-oveview.html. (2021). Accessed: 2021-06-24.
[54]
NVIDIA. 2017. NVIDIA Mellanox Spectrum Switch. https://www.mellanox.com/files/doc-2020/pb-spectrum-switch.pdf. (2017).
[55]
NVIDIA. 2021. NVIDIA BLUEFIELD-2 DPU. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/documents/datasheet-nvidia-bluefield-2-dpu.pdf. (2021). Accessed: 2022-01-25.
[56]
Jeff Rasley, Brent Stephens, Colin Dixon, Eric Rozner, Wes Felter, Kanak Agarwal, John Carter, and Rodrigo Fonseca. 2014. Planck: Millisecond-Scale Monitoring and Control for Commodity Networks. In Proceedings of the 2014 ACM Conference on SIGCOMM. Association for Computing Machinery, 407--418.
[57]
Mariano Scazzariello, Tommaso Caiazzi, Hamid Ghasemirahni, Tom Barbette, Dejan Kostic, and Marco Chiesa. 2023. A High-Speed Stateful Packet Processing Approach for Tbps Programmable Switches. In Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22).
[58]
Satadal Sengupta, Hyojoon Kim, and Jennifer Rexford. 2022. Continuous In-Network Round-Trip Time Monitoring. In Proceedings of the ACM SIGCOMM 2022 Conference (SIGCOMM '22). Association for Computing Machinery, New York, NY, USA, 473--485.
[59]
David Sidler, Zeke Wang, Monica Chiosa, Amit Kulkarni, and Gustavo Alonso. 2020. StRoM: smart remote memory. In Proceedings of the Fifteenth European Conference on Computer Systems. 1--16.
[60]
Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, Anand Kanagala, Jeff Provost, Jason Simmons, Eiichi Tanda, Jim Wanderer, Urs Hölzle, Stephen Stuart, and Amin Vahdat. 2015. Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication. Association for Computing Machinery, 183--197.
[61]
John Sonchack, Adam J Aviv, Eric Keller, and Jonathan M Smith. 2018. Turboflow: Information rich flow record generation on commodity switches. In Proceedings of the Thirteenth EuroSys Conference. 1--16.
[62]
Pensando Systems. 2021. Pensando DSC-100 Distributed Services Card. https://pensando.io/wp-content/uploads/2020/03/DSC-100-ProductBrief-v06.pdf. (2021). Accessed: 2022-01-23.
[63]
Praveen Tammana, Rachit Agarwal, and Myungjin Lee. 2018. Distributed network monitoring and debugging with switchpointer. In 15th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 18). 453--456.
[64]
Mellanox Technologies. 2020. ConnectX®-6 VPI Card. https://www.mellanox.com/files/doc-2020/pb-connectx-6-vpi-card.pdf. (2020). Accessed: 2021-05-12.
[65]
Ross Teixeira, Rob Harrison, Arpit Gupta, and Jennifer Rexford. 2020. Packetscope: Monitoring the packet lifecycle inside a switch. In Proceedings of the Symposium on SDN Research. 76--82.
[66]
Olivier Tilmans, Tobias Bühler, Ingmar Poese, Stefano Vissicchio, and Laurent Vanbever. 2018. Stroboscope: Declarative Network Monitoring on a Budget. In Proceedings of the 15th USENIX Conference on Networked Systems Design and Implementation. USENIX Association, 467--482.
[67]
Nguyen Van Tu, Jonghwan Hyun, and James Won-Ki Hong. 2017. Towards onos-based sdn monitoring using in-band network telemetry. In 2017 19th Asia-Pacific Network Operations and Management Symposium (APNOMS). IEEE, 76--81.
[68]
Nguyen Van Tu, Jonghwan Hyun, Ga Yeon Kim, Jae-Hyoung Yoo, and James Won-Ki Hong. 2018. Intcollector: A high-performance collector for in-band network telemetry. In 2018 14th International Conference on Network and Service Management (CNSM). IEEE, 10--18.
[69]
Jonathan Vestin, Andreas Kassler, Deval Bhamare, Karl-Johan Grinnemo, Jan-Olof Andersson, and Gergely Pongracz. 2019. Programmable event detection for in-band network telemetry. In 2019 IEEE 8th international conference on cloud networking (CloudNet). IEEE, 1--6.
[70]
Xilinx. 2021. Xilinx Embedded RDMA Enabled NIC. https://www.xilinx.com/support/documentation/ip_documentation/ernic/v3_0/pg332-ernic.pdf. (2021). Accessed: 2021-06-11.
[71]
Tong Yang, Jie Jiang, Peng Liu, Qun Huang, Junzhi Gong, Yang Zhou, Rui Miao, Xiaoming Li, and Steve Uhlig. 2018. Elastic sketch: Adaptive and fast network-wide measurements. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication. 561--575.
[72]
Minlan Yu. 2019. Network telemetry: towards a top-down approach. ACM SIGCOMM Computer Communication Review 49, 1 (2019), 11--17.
[73]
Qiao Zhang, Vincent Liu, Hongyi Zeng, and Arvind Krishnamurthy. 2017. High-Resolution Measurement of Data Center Microbursts. In Proceedings of the 2017 Internet Measurement Conference. Association for Computing Machinery, 78--85.
[74]
Yu Zhou, Jun Bi, Tong Yang, Kai Gao, Jiamin Cao, Dai Zhang, Yangyang Wang, and Cheng Zhang. 2020. Hypersight: Towards scalable, high-coverage, and dynamic network monitoring queries. IEEE Journal on Selected Areas in Communications 38, 6 (2020), 1147--1160.
[75]
Yu Zhou, Chen Sun, Hongqiang Harry Liu, Rui Miao, Shi Bai, Bo Li, Zhilong Zheng, Lingjun Zhu, Zhen Shen, Yongqing Xi, et al. 2020. Flow event telemetry on programmable data plane. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication. 76--89.
[76]
Yibo Zhu, Nanxi Kang, Jiaxin Cao, Albert Greenberg, Guohan Lu, Ratul Mahajan, Dave Maltz, Lihua Yuan, Ming Zhang, Ben Y Zhao, et al. 2015. Packet-level telemetry in large datacenter networks. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication. 479--491.

Cited By

View all
  • (2024)Feasibility of Application Layer Header Parsing in eBPF and P42024 IFIP Networking Conference (IFIP Networking)10.23919/IFIPNetworking62109.2024.10619855(475-481)Online publication date: 3-Jun-2024
  • (2024)Hostmesh: Monitor and Diagnose Networks in Rail-optimized RoCE ClustersProceedings of the 8th Asia-Pacific Workshop on Networking10.1145/3663408.3663426(122-128)Online publication date: 3-Aug-2024
  • (2024)R-Pingmesh: A Service-Aware RoCE Network Monitoring and Diagnostic SystemProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672264(554-567)Online publication date: 4-Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ACM SIGCOMM '23: Proceedings of the ACM SIGCOMM 2023 Conference
September 2023
1217 pages
ISBN:9798400702365
DOI:10.1145/3603269
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2023

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. telemetry collection
  2. monitoring
  3. remote direct memory access

Qualifiers

  • Research-article

Funding Sources

Conference

ACM SIGCOMM '23
Sponsor:
ACM SIGCOMM '23: ACM SIGCOMM 2023 Conference
September 10, 2023
NY, New York, USA

Acceptance Rates

Overall Acceptance Rate 462 of 3,389 submissions, 14%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)737
  • Downloads (Last 6 weeks)36
Reflects downloads up to 03 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Feasibility of Application Layer Header Parsing in eBPF and P42024 IFIP Networking Conference (IFIP Networking)10.23919/IFIPNetworking62109.2024.10619855(475-481)Online publication date: 3-Jun-2024
  • (2024)Hostmesh: Monitor and Diagnose Networks in Rail-optimized RoCE ClustersProceedings of the 8th Asia-Pacific Workshop on Networking10.1145/3663408.3663426(122-128)Online publication date: 3-Aug-2024
  • (2024)R-Pingmesh: A Service-Aware RoCE Network Monitoring and Diagnostic SystemProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672264(554-567)Online publication date: 4-Aug-2024
  • (2024)Zoom2Net: Constrained Network Telemetry ImputationProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672225(764-777)Online publication date: 4-Aug-2024
  • (2024)HPTCollector: high-performance telemetry collectorCluster Computing10.1007/s10586-024-04650-wOnline publication date: 30-Jul-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media