Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3517745.3561447acmconferencesArticle/Chapter ViewAbstractPublication PagesimcConference Proceedingsconference-collections
research-article

Cross-layer diagnosis of optical backbone failures

Published: 25 October 2022 Publication History

Abstract

Optical backbone networks, the physical infrastructure interconnecting data centers, are the cornerstones of Wide-Area Network (WAN) connectivity and resilience. Yet, there is limited research on failure characteristics and diagnosis in large-scale operational optical networks. This paper fills the gap by presenting a comprehensive analysis and modeling of optical network failures from a production optical backbone consisting of hundreds of sites and thousands of optical devices. Subsequently, we present a diagnosis system for optical backbone failures, consisting of a multi-level dependency graph and a root-cause inference algorithm across the IP and optical layers. Further, we share our experiences of operating this system for six years and introduce three methods to make the outcome actionable in practice. With empirical evaluation, we demonstrate its high accuracy of 96% and a ticket reduction of 95% for our optical backbone.

Supplementary Material

M4V File (266.m4v)
Presentation video

References

[1]
Netnorad: Troubleshooting networks via end-to-end probing. https://engineering.fb.com/core-data/netnorad-troubleshooting-networks-via-end-to-end-probing/.
[2]
Network configuration protocol. https://tools.ietf.org/html/rfc6241.
[3]
Snmp trap. https://www.cisco.com/c/en/us/support/docs/ip/simple-network-management-protocol-snmp/7244-snmp-trap.html.
[4]
Splicebox. https://en.wikipedia.org/wiki/Splicebox.
[5]
Transaction language 1. https://en.wikipedia.org/wiki/Transaction_Language_1.
[6]
Squirrels are the number one culprit for animal damage to aerial fiber, 2011. https://www.theatlantic.com/technology/archive/2011/08/squirrels-do-17-of-the-damage-to-fiber-optic-network/243319/.
[7]
Disaster survivability in optical communication networks. Computer Communications 36, 6 (2013), 630--644. Reliable Network-based Services.
[8]
Cows were causing mysterious google outages, 2020. https://www.businessinsider.com/cows-were-causing-mysterious-google-outages-2020-5.
[9]
Agarwal, B., Bhagwan, R., Das, T., Eswaran, S., Padmanabhan, V., and Voelker, G. Netprints: Diagnosing home network misconfigurations using shared knowledge. In NSDI (01 2009).
[10]
Arzani, B., Ciraci, S., Chamon, L., Zhu, Y., Liu, H., Padhye, J., Loo, B. T., and Outhred, G. 007 democratically finding the cause of packet drops. In Proceedings of the 15th USENIX Conference on Networked Systems Design and Implementation (2018).
[11]
Babarczi, P., Tapolcai, J., and Ho, P.-H. Adjacent link failure localization with monitoring trails in all-optical mesh networks. IEEE/ACM Transactions on Networking 19, 3 (2011), 907--920.
[12]
Babbitt, J., and Best, R. Maintaining availability in an optical backbone network. In Optical Fiber Communication Conference and Exposition and The National Fiber Optic Engineers Conference (2006), Optica Publishing Group, p. NThB1.
[13]
Bahl, P., Chandra, R., Greenberg, A., Kandula, S., Maltz, D. A., and Zhang, M. Towards highly reliable enterprise network services via inference of multi-level dependencies. In Proceedings of the 2007 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (New York, NY, USA, 2007).
[14]
Chen, X., Zhang, M., Mao, Z., and Bahl, P. Automating network application dependency discovery: Experiences, limitations, and new solutions. In OSDI (01 2008).
[15]
Dikbiyik, F., Tornatore, M., and Mukherjee, B. Minimizing the risk from disaster failures in optical backbone networks. J. Lightwave Technol. 32, 18 (Sep 2014), 3175--3183.
[16]
Dou, S., Lindsey, N., Wagner, A. M., Daley, T. M., Freifeld, B., Robertson, M., Peterson, J., Ulrich, C., Martin, E. R., and AjoFranklin, J. B. Distributed acoustic sensing for seismic monitoring of the near surface: A traffic-noise interferometry case study. In Scientific Reports (2017).
[17]
Ghobadi, M., and Mahajan, R. Optical layer failures in a large backbone. In Proceedings of the 2016 Internet Measurement Conference (2016).
[18]
Guo, C., Yuan, L., Xiang, D., Dang, Y., Huang, R., Maltz, D., Liu, Z., Wang, V., Pang, B., Chen, H., et al. Pingmesh: A large-scale system for data center network latency measurement and analysis. In ACM SIGCOMM Computer Communication Review (2015), vol. 45, ACM, pp. 139--152.
[19]
Habib, M. F., Musumeci, F., Tornatore, M., and Mukherjee, B. Cascading-failure-resilient interconnection for interdependent power grid - optical network. Optical Switching and Networking 42 (2021), 100632.
[20]
Kandula, S., Mahajan, R., Verkaik, P., Agarwal, S., Padhye, J., and Bahl, P. Detailed diagnosis in enterprise networks. vol. 39, pp. 243--254.
[21]
Kompella, R. R., Yates, J., Greenberg, A., and Snoeren, A. C. Ip fault localization via risk modeling. In Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation-Volume 2 (2005), USENIX Association, pp. 57--70.
[22]
Kumar, D., Kumar, R., and Sharma, N. A risk reduction approach in optical backbone network. In 2019 5th International Conference on Signal Processing, Computing and Control (ISPCC) (2019), pp. 206--211.
[23]
Mahimkar, A., Yates, J., Zhang, Y., Shaikh, A., Wang, J., Ge, Z., and Ee, C. Troubleshooting chronic conditions in large ip networks. In CoNEXT (01 2008), p. 2.
[24]
Markopoulou, A., Iannaccone, G., Bhattacharyya, S., Chuah, C.-N., Ganjali, Y., and Diot, C. Characterization of failures in an operational ip backbone network. IEEE/ACM Trans. Netw. 16, 4 (2008).
[25]
Miao, C., Minggang, C., Gupta, A., Meng, Z., Chen, J., Zekun, H., Luo, X., Wang, J., and Yu, H. Detecting ephemeral optical events with optel. 19th USENIX Symposium on Networked Systems Design and Implementation.
[26]
Mogul, J. C., Goricanec, D., Pool, M., Shaikh, A., Turk, D., Koley, B., and Zhao, X. Experiences with modeling network topologies at multiple levels of abstraction. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20) (Santa Clara, CA, Feb. 2020), USENIX Association, pp. 403--418.
[27]
Mysore, R. N., Mahajan, R., Vahdat, A., and Varghese, G. Gestalt: Fast, unified fault localization for networked systems. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference (Berkeley, CA, USA, 2014), USENIX ATC'14, USENIX Association.
[28]
Owen, A., Duckworth, G., and Worsley, J. Optasense: Fibre optic distributed acoustic sensing for border monitoring. In 2012 European Intelligence and Security Informatics Conference (2012), pp. 362--364.
[29]
Roy, A., Zeng, H., Bagga, J., and Snoeren, A. C. Passive realtime datacenter fault detection and localization. In Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation (2017).
[30]
Tammana, P., Agarwal, R., and Lee, M. Simplifying datacenter network debugging with pathdump. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (2016).
[31]
Tan, C., Jin, Z., Guo, C., Zhang, T., Wu, H., Deng, K., Bi, D., and Xiang, D. Netbouncer: Active device and link failure localization in data center networks. In Proceedings of the 16th USENIX Conference on Networked Systems Design and Implementation (USA, 2019), USENIX Association.
[32]
Vadrevu, C. S., and Tornatore, M. Survivable ip topology design with re-use of backup wavelength capacity in optical backbone networks. Optical Switching and Networking 7, 4 (2010), 196--205. Selected Papers from the Third International Symposium on Advanced Networks and Telecommunication Systems (ANTS 2009).
[33]
Wang, Z., Zhang, M., Wang, D., Song, C., Liu, M., Li, J., Lou, L., and Liu, Z. Failure prediction using machine learning and time series in optical network. Opt. Express 25, 16 (Aug 2017), 18553--18565.
[34]
Wiatr, P., Chen, J., Monti, P., Wosinska, L., and Yuan, D. Routing and wavelength assignment vs. edfa reliability performance in optical backbone networks: An operational cost perspective. Optical Switching and Networking 31 (2019), 211--217.
[35]
wu, X., Turner, D., Chen, C.-C., Maltz, D., Yang, X., Yuan, L., and Zhang, M. Netpilot: Automating datacenter network failure mitigation. ACM SIGCOMM Computer Communication Review 42 (09 2012), 419--430.
[36]
Wundsam, A., Levin, D., Seetharaman, S., and Feldmann, A. Ofrewind: Enabling record and replay troubleshooting for networks. In USENIX Annual technical conference (06 2011).
[37]
Xia, Y., Zhang, Y., Zhong, Z., Yan, G., Lim, C. L., Ahuja, S. S., Bali, S., Nikolaidis, A., Ghobadi, K., and Ghobadi, M. A social network under social distancing: Risk-driven backbone management during covid-19 and beyond. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21) (Apr. 2021).
[38]
Yu, D., Zhu, Y., Arzani, B., Fonseca, R., Zhang, T., Deng, K., and Yuan, L. Dshark: A general, easy to program and scalable framework for analyzing in-network packet traces. In Proceedings of the 16th USENIX Conference on Networked Systems Design and Implementation (USA, 2019), USENIX Association.
[39]
Zhou, Y., Sun, C., Liu, H. H., Miao, R., Bai, S., Li, B., Zheng, Z., Zhu, L., Shen, Z., Xi, Y., Zhang, P., Cai, D., Zhang, M., and Xu, M. Flow event telemetry on programmable data plane. In SIGCOMM (2020).
[40]
Zhu, Y., Kang, N., Cao, J., Greenberg, A., Lu, G., Mahajan, R., Maltz, D., Yuan, L., Zhang, M., Zhao, B. Y., and et al. Packet-level telemetry in large datacenter networks. In SIGCOMM (2015).
[41]
Zhuo, D., Ghobadi, M., Mahajan, R., Förster, K.-T., Krishnamurthy, A., and Anderson, T. Understanding and mitigating packet corruption in data center networks. In SIGCOMM (New York, NY, USA, 2017), Association for Computing Machinery.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
IMC '22: Proceedings of the 22nd ACM Internet Measurement Conference
October 2022
796 pages
ISBN:9781450392594
DOI:10.1145/3517745
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • USENIX Assoc: USENIX Assoc

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 October 2022

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

IMC '22
IMC '22: ACM Internet Measurement Conference
October 25 - 27, 2022
Nice, France

Acceptance Rates

Overall Acceptance Rate 277 of 1,083 submissions, 26%

Upcoming Conference

IMC '24
ACM Internet Measurement Conference
November 4 - 6, 2024
Madrid , AA , Spain

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 319
    Total Downloads
  • Downloads (Last 12 months)72
  • Downloads (Last 6 weeks)10
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media