Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis

Published: 17 August 2015 Publication History

Abstract

Can we get network latency between any two servers at any time in large-scale data center networks? The collected latency data can then be used to address a series of challenges: telling if an application perceived latency issue is caused by the network or not, defining and tracking network service level agreement (SLA), and automatic network troubleshooting. We have developed the Pingmesh system for large-scale data center network latency measurement and analysis to answer the above question affirmatively. Pingmesh has been running in Microsoft data centers for more than four years, and it collects tens of terabytes of latency data per day. Pingmesh is widely used by not only network software developers and engineers, but also application and service developers and operators.

Supplementary Material

WEBM File (p139-guo.webm)

References

[1]
M. Al-Fares, A. Loukissas, and A. Vahdat. A Scalable, Commodity Data Center Network Architecture. In Proc. SIGCOMM, 2008.
[2]
Alexey Andreyev. Introducing data center fabric, the next-generation Facebook data center network. https://code.facebook.com/posts/360346274145943/, Nov 2014.
[3]
Hadoop. http://hadoop.apache.org/.
[4]
Peter Bailis and Kyle Kingsbury. The Network is Reliable: An Informal Survey of Real-World Communications Failures. ACM Queue, 2014.
[5]
Luiz Barroso, Jeffrey Dean, and Urs H$\ddoto$lzle. Web Search for a Planet: The Google Cluster Architecture. IEEE Micro, March-April 2003.
[6]
Theophilus Benson, Aditya Akella, and David A. Maltz. Network Traffic Characteristics of Data Centers in the Wild. In Internet Measurement Conference, November 2010.
[7]
et.al Brad Calder. Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency. In SOSP, 2011.
[8]
Cisco. IP SLAs Configuration Guide, Cisco IOS Release 12.4T. http://www.cisco.com/c/en/us/td/docs/ios-xml/ios/ipsla/configuration/12--4t/sla-12--4t-book.pdf.
[9]
Citrix. What is Load Balancing? http://www.citrix.com/glossary/load-balancing.html.
[10]
Jeffrey Dean and Luiz Andr$\acutee$ Barroso. The Tail at Scale. CACM, Februry 2013.
[11]
Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, 2004.
[12]
Albert Greenberg et al. VL2: A Scalable and Flexible Data Center Network. In SIGCOMM, August 2009.
[13]
Chi-Yao Hong et al. Achieving High Utilization with Software-Driven WAN. In SIGCOMM, 2013.
[14]
Parveen Patel et al. Ananta: Cloud Scale Load Balancing. In ACM SIGCOMMM. ACM, 2013.
[15]
R. Chaiken et al. SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. In VLDB'08, 2008.
[16]
Sushant Jain et al. B4: Experience with a Globally-Deployed Software Defined WAN. In SIGCOMM, 2013.
[17]
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. In ACM SOSP. ACM, 2003.
[18]
Nicolas Guilbaud and Ross Cartlidge. Google Backbone Monitoring, Localizing Packet Loss in a Large Complex Network, Feburary 2013. Nanog57.
[19]
Nikhil Handigol, Brandon Heller, Vimalkumar Jeyakumar, David Mazi$\gravee$res, and Nick McKeown. I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks. In NSDI, 2014.
[20]
Michael Isard. Autopilot: Automatic Data Center Management. ACM SIGOPS Operating Systems Review, 2007.
[21]
Srikanth Kandula, Sudipta Sengupta, Albert Greenberg, Parveen Patel, and Ronnie Chaiken. The nature of data center traffic: Measurements & analysis. In Proceedings of the 9th ACM SIGCOMM Conference on Internet Measurement Conference, IMC '09, 2009.
[22]
Rishi Kapoor, Alex C. Snoeren, Geoffrey M. Voelker, and George Porter. Bullet Trains: A Study of NIC Burst Behavior at Microsecond Timescales. In ACM CoNEXT, 2013.
[23]
Cade Metz. Return of the Borg: How Twitter Rebuilt Google's Secret Weapon. http://www.wired.com/2013/03/google-borg-twitter-mesos/all/, March 2013.
[24]
Wenfei Wu, Guohui Wang, Aditya Akella, and Anees Shaikh. Virtual Network Diagnosis as a Service. In SoCC, 2013.
[25]
Hongyi Zeng, Peyman Kazemian, George Varghese, and Nick McKeown. Automatic Test Packet Generation. In CoNEXT, 2012.

Cited By

View all
  • (2024)Design model of a twisted and folded Clos network with multi-step grouped intermediate switches guaranteeing admissible blocking probabilityJournal of Optical Communications and Networking10.1364/JOCN.51389816:3(328)Online publication date: 21-Feb-2024
  • (2024)INT-Label: Lightweight In-Band Network-Wide Telemetry via Distributed LabelingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.336793335:5(751-767)Online publication date: May-2024
  • (2024)CloudSentry: Two-Stage Heavy Hitter Detection for Cloud-Scale Gateway Overload ProtectionIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.330185235:4(616-633)Online publication date: Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGCOMM Computer Communication Review
ACM SIGCOMM Computer Communication Review  Volume 45, Issue 4
SIGCOMM'15
October 2015
659 pages
ISSN:0146-4833
DOI:10.1145/2829988
Issue’s Table of Contents
  • cover image ACM Conferences
    SIGCOMM '15: Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication
    August 2015
    684 pages
    ISBN:9781450335423
    DOI:10.1145/2785956
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 August 2015
Published in SIGCOMM-CCR Volume 45, Issue 4

Check for updates

Author Tags

  1. data center networking
  2. network troubleshooting
  3. silent packet drops

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)839
  • Downloads (Last 6 weeks)128
Reflects downloads up to 12 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Design model of a twisted and folded Clos network with multi-step grouped intermediate switches guaranteeing admissible blocking probabilityJournal of Optical Communications and Networking10.1364/JOCN.51389816:3(328)Online publication date: 21-Feb-2024
  • (2024)INT-Label: Lightweight In-Band Network-Wide Telemetry via Distributed LabelingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.336793335:5(751-767)Online publication date: May-2024
  • (2024)CloudSentry: Two-Stage Heavy Hitter Detection for Cloud-Scale Gateway Overload ProtectionIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.330185235:4(616-633)Online publication date: Apr-2024
  • (2024)SFANT: A SRv6-Based Flexible and Active Network Telemetry Scheme in Programming Data PlaneIEEE Transactions on Network Science and Engineering10.1109/TNSE.2023.327700011:3(2415-2425)Online publication date: May-2024
  • (2024)Proactive Telemetry in Large-Scale Multi-Tenant Cloud Overlay NetworksIEEE/ACM Transactions on Networking10.1109/TNET.2024.338178632:4(3002-3017)Online publication date: Aug-2024
  • (2023)PDLB: Path Diversity-aware Load Balancing with adaptive granularity in data center networksJournal of Cloud Computing10.1186/s13677-023-00548-x12:1Online publication date: 7-Dec-2023
  • (2023)MARS: Fault Localization in Programmable Networking Systems with Low-cost In-Band Network TelemetryProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605622(347-357)Online publication date: 7-Aug-2023
  • (2023)Tripartite Graph Aided Tensor Completion For Sparse Network MeasurementIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321325934:1(48-62)Online publication date: 1-Jan-2023
  • (2023)Fast, Scalable and Robust Centralized Routing for Data Center NetworksIEEE/ACM Transactions on Networking10.1109/TNET.2023.325954131:6(2624-2639)Online publication date: Dec-2023
  • (2023)CocoSketch: High-Performance Sketch-Based Measurement Over Arbitrary Partial Key QueryIEEE/ACM Transactions on Networking10.1109/TNET.2023.325722631:6(2653-2668)Online publication date: Dec-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media