Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2785956.2787496acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article
Free access

Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis

Published: 17 August 2015 Publication History

Abstract

Can we get network latency between any two servers at any time in large-scale data center networks? The collected latency data can then be used to address a series of challenges: telling if an application perceived latency issue is caused by the network or not, defining and tracking network service level agreement (SLA), and automatic network troubleshooting. We have developed the Pingmesh system for large-scale data center network latency measurement and analysis to answer the above question affirmatively. Pingmesh has been running in Microsoft data centers for more than four years, and it collects tens of terabytes of latency data per day. Pingmesh is widely used by not only network software developers and engineers, but also application and service developers and operators.

Supplementary Material

WEBM File (p139-guo.webm)

References

[1]
M. Al-Fares, A. Loukissas, and A. Vahdat. A Scalable, Commodity Data Center Network Architecture. In Proc. SIGCOMM, 2008.
[2]
Alexey Andreyev. Introducing data center fabric, the next-generation Facebook data center network. https://code.facebook.com/posts/360346274145943/, Nov 2014.
[3]
Hadoop. http://hadoop.apache.org/.
[4]
Peter Bailis and Kyle Kingsbury. The Network is Reliable: An Informal Survey of Real-World Communications Failures. ACM Queue, 2014.
[5]
Luiz Barroso, Jeffrey Dean, and Urs H$\ddoto$lzle. Web Search for a Planet: The Google Cluster Architecture. IEEE Micro, March-April 2003.
[6]
Theophilus Benson, Aditya Akella, and David A. Maltz. Network Traffic Characteristics of Data Centers in the Wild. In Internet Measurement Conference, November 2010.
[7]
et.al Brad Calder. Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency. In SOSP, 2011.
[8]
Cisco. IP SLAs Configuration Guide, Cisco IOS Release 12.4T. http://www.cisco.com/c/en/us/td/docs/ios-xml/ios/ipsla/configuration/12--4t/sla-12--4t-book.pdf.
[9]
Citrix. What is Load Balancing? http://www.citrix.com/glossary/load-balancing.html.
[10]
Jeffrey Dean and Luiz Andr$\acutee$ Barroso. The Tail at Scale. CACM, Februry 2013.
[11]
Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, 2004.
[12]
Albert Greenberg et al. VL2: A Scalable and Flexible Data Center Network. In SIGCOMM, August 2009.
[13]
Chi-Yao Hong et al. Achieving High Utilization with Software-Driven WAN. In SIGCOMM, 2013.
[14]
Parveen Patel et al. Ananta: Cloud Scale Load Balancing. In ACM SIGCOMMM. ACM, 2013.
[15]
R. Chaiken et al. SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. In VLDB'08, 2008.
[16]
Sushant Jain et al. B4: Experience with a Globally-Deployed Software Defined WAN. In SIGCOMM, 2013.
[17]
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. In ACM SOSP. ACM, 2003.
[18]
Nicolas Guilbaud and Ross Cartlidge. Google Backbone Monitoring, Localizing Packet Loss in a Large Complex Network, Feburary 2013. Nanog57.
[19]
Nikhil Handigol, Brandon Heller, Vimalkumar Jeyakumar, David Mazi$\gravee$res, and Nick McKeown. I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks. In NSDI, 2014.
[20]
Michael Isard. Autopilot: Automatic Data Center Management. ACM SIGOPS Operating Systems Review, 2007.
[21]
Srikanth Kandula, Sudipta Sengupta, Albert Greenberg, Parveen Patel, and Ronnie Chaiken. The nature of data center traffic: Measurements & analysis. In Proceedings of the 9th ACM SIGCOMM Conference on Internet Measurement Conference, IMC '09, 2009.
[22]
Rishi Kapoor, Alex C. Snoeren, Geoffrey M. Voelker, and George Porter. Bullet Trains: A Study of NIC Burst Behavior at Microsecond Timescales. In ACM CoNEXT, 2013.
[23]
Cade Metz. Return of the Borg: How Twitter Rebuilt Google's Secret Weapon. http://www.wired.com/2013/03/google-borg-twitter-mesos/all/, March 2013.
[24]
Wenfei Wu, Guohui Wang, Aditya Akella, and Anees Shaikh. Virtual Network Diagnosis as a Service. In SoCC, 2013.
[25]
Hongyi Zeng, Peyman Kazemian, George Varghese, and Nick McKeown. Automatic Test Packet Generation. In CoNEXT, 2012.

Cited By

View all
  • (2025)Dynamic Service Placement in Edge Computing: A Comparative Evaluation of Nature-Inspired AlgorithmsIEEE Access10.1109/ACCESS.2024.352070113(2653-2670)Online publication date: 2025
  • (2025)Probe-Optimizer: Discovering important nodes for proactive in-band network telemetry to achieve better probe orchestrationComputer Networks10.1016/j.comnet.2024.110935257(110935)Online publication date: Feb-2025
  • (2024)Diagnosing application-network anomalies for millions of IPs in production cloudsProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692046(885-899)Online publication date: 10-Jul-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGCOMM '15: Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication
August 2015
684 pages
ISBN:9781450335423
DOI:10.1145/2785956
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 August 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data center networking
  2. network troubleshooting
  3. silent packet drops

Qualifiers

  • Research-article

Conference

SIGCOMM '15
Sponsor:
SIGCOMM '15: ACM SIGCOMM 2015 Conference
August 17 - 21, 2015
London, United Kingdom

Acceptance Rates

SIGCOMM '15 Paper Acceptance Rate 40 of 242 submissions, 17%;
Overall Acceptance Rate 462 of 3,389 submissions, 14%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)940
  • Downloads (Last 6 weeks)106
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Dynamic Service Placement in Edge Computing: A Comparative Evaluation of Nature-Inspired AlgorithmsIEEE Access10.1109/ACCESS.2024.352070113(2653-2670)Online publication date: 2025
  • (2025)Probe-Optimizer: Discovering important nodes for proactive in-band network telemetry to achieve better probe orchestrationComputer Networks10.1016/j.comnet.2024.110935257(110935)Online publication date: Feb-2025
  • (2024)Diagnosing application-network anomalies for millions of IPs in production cloudsProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692046(885-899)Online publication date: 10-Jul-2024
  • (2024)MSFRDProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692045(869-884)Online publication date: 10-Jul-2024
  • (2024)NetAssistantProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691935(2011-2023)Online publication date: 16-Apr-2024
  • (2024)Towards domain-specific network transport for distributed DNN trainingProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691904(1421-1443)Online publication date: 16-Apr-2024
  • (2024)CrescentProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691883(1045-1062)Online publication date: 16-Apr-2024
  • (2024)HorusProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691826(1-22)Online publication date: 16-Apr-2024
  • (2024)Toward Global Latency Transparency2024 IFIP Networking Conference (IFIP Networking)10.23919/IFIPNetworking62109.2024.10619858(536-542)Online publication date: 3-Jun-2024
  • (2024)F3: Fast and Flexible Network Telemetry with an FPGA coprocessorProceedings of the ACM on Networking10.1145/36963972:CoNEXT4(1-22)Online publication date: 25-Nov-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media