Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Lancet: Better Network Resilience by Designing for Pruned Failure Sets

Published: 17 December 2019 Publication History

Abstract

Recently, researchers have started exploring the design of route protection schemes that ensure networks can sustain traffic demand without congestion under failures. Existing approaches focus on ensuring worst-case performance over simultaneous f-failure scenarios is acceptable. Unfortunately, even a single bad scenario may render the schemes unable to protect against any f-failure scenario. In this paper, we present Lancet, a system designed to handle most failures when not all can be tackled. Lancet comprises three components: (i) an algorithm to analyze which failure scenarios the network can intrinsically handle which provides a benchmark for any protection routing scheme, and guides the design of new schemes; (ii) an approach to efficiently design a protection schemes for more general failure sets than all f-failure scenarios; and (iii) techniques to determine which of combinatorially many scenarios to design for. Our evaluations with real topologies and validations on an emulation testbed show that Lancet outperforms a worst-case approach by protecting against many more scenarios, and can even match the scenarios that can be handled by optimal network response.

References

[1]
Topology zoo. http://www.topology-zoo.org/.
[2]
Abilene traffic matrices. http://www.cs.utexas.edu/~yzhang/research/AbileneTM/, 2014.
[3]
Inside AT&T's grand plans for SDN. https://www.networkworld.com/article/2866439/sdn/inside-atts-grand-plans-for-sdn.html, 2015.
[4]
Cisco WAN automation engine (WAE), 2016. http://www.cisco.com/c/en/us/products/routers/wan-automation-engine/index.html.
[5]
Building Express Backbone: Facebook's new long-haul network. https://code.facebook.com/posts/1782709872057497/building-express-backbone-facebook-s-new-long-haul-network/, 2017.
[6]
Gustavo Angulo, Shabbir Ahmed, Santanu~S. Dey, and Volker Kaibel. Forbidden vertices. Mathematics of Operations Research, 40 (2): 350--360, 2015.
[7]
David Applegate and Edith Cohen. Making intra-domain routing robust to changing and uncertain traffic demands: Understanding fundamental tradeoffs. In Proceedings of ACM SIGCOMM, pages 313--324, 2003.
[8]
David Applegate, Lee Breslau, and Edith Cohen. Coping with network failures: Routing strategies for optimal demand oblivious restoration. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '04/Performance '04, pages 270--281, 2004.
[9]
Ajay~Kumar Bangla, Alireza Ghaffarkhah, Ben Preskill, Bikash Koley, Christoph Albrecht, Emilie Danna, Joe Jiang, and Xiaoxue Zhao. Capacity planning for the google backbone network. In ISMP 2015 (International Symposium on Mathematical Programming), 2015.
[10]
Randeep~S. Bhatia, Murali Kodialam, T. V. Lakshman, and Sudipta Sengupta. Bandwidth guaranteed routing with fast restoration against link and node failures. IEEE/ACM Transactions on Networking, 16 (6): 1321--1330, December 2008.
[11]
Martin Birk, Gagan Choudhury, Bruce Cortez, Alvin Goddard, Narayan Padi, Aswatnarayan Raghuram, Kathy Tse, Simon Tse, Andrew Wallace, and Kang Xi. Evolving to an SDN-enabled isp backbone: key technologies and applications. IEEE Communications Magazine, 54 (10): 129--135, 2016.
[12]
Jeremy Bogle, Nikhil Bhatia, Manya Ghobadi, Ishai Menache, Nikolaj Bjorner, Asaf Valadarsky, and Michael Schapira. Teavar: Striking the right utilization-availability balance in wan traffic engineering. In Proceedings of ACM SIGCOMM, 2019. (to appear).
[13]
Michael Borokhovich, Yvonne-Anne Pignolet, Stefan Schmid, and Gilles Tredan. Load-optimal local fast rerouting for dense networks. IEEE/ACM Transactions on Networking, 26 (6): 2583--2597, 2018.
[14]
Yiyang Chang, Sanjay Rao, and Mohit Tawarmalani. Robust validation of network designs under uncertain demands and failures. In 14$^th$ USENIX Symposium on Networked Systems Design and Implementation (NSDI), pages 347--362, 2017.
[15]
Michele Conforti, Gerard Cornuejols, and Giacomo Zambelli. Integer Programming. Springer Publishing Company, Incorporated, 2014.
[16]
Klaus-Tycho Foerster, Yvonne-Anne Pignolet, Stefan Schmid, and Gilles Tredan. Casa: congestion and stretch aware static fast rerouting. In Proceedings of IEEE INFOCOM, pages 469--477, 2019.
[17]
Bernard Fortz and Mikkel Thorup. Robust optimization of OSPF/IS-IS weights. In Proceedings of International Network Optimization Conference, pages 225--230, 2003.
[18]
Monia Ghobadi and Ratul Mahajan. Optical layer failures in a large backbone. In Proceedings of the 2016 Internet Measurement Conference, pages 461--467, 2016.
[19]
Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. Understanding network failures in data centers: Measurement, analysis, and implications. In Proceedings of ACM SIGCOMM, pages 350--361, 2011.
[20]
Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley, and Amin Vahdat. Evolve or die: High-availability design principles drawn from googles network infrastructure. In Proceedings of ACM SIGCOMM, pages 58--72, 2016.
[21]
Fang Hao, Murali Kodialam, and T. V. Lakshman. Optimizing restoration with segment routing. In Proceedings of IEEE INFOCOM, pages 1--9, April 2016.
[22]
Chi-Yao Hong, Srikanth Kandula, Ratul Mahajan, Ming Zhang, Vijay Gill, Mohan Nanduri, and Roger Wattenhofer. Achieving high utilization with software-driven wan. In Proceedings of ACM SIGCOMM, pages 15--26, 2013.
[23]
Chi-Yao Hong, Subhasree Mandal, Mohammad Al-Fares, Min Zhu, Richard Alimi, Kondapa~Naidu B., Chandan Bhagat, Sourabh Jain, Jay Kaimal, Shiyu Liang, Kirill Mendelev, Steve Padgett, Faro Rabe, Saikat Ray, Malveeka Tewari, Matt Tierney, Monika Zahn, Jonathan Zolla, Joon Ong, and Amin Vahdat. B4 and after: Managing hierarchy, partitioning, and asymmetry for availability and scale in google's software-defined wan. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, pages 74--87, 2018.
[24]
Gurobi~Optimization Inc. Gurobi optimizer reference manual, 2016. http://www.gurobi.com.
[25]
le, Stuart, and Vahdat]b4Sushant Jain, Alok Kumar, Subhasree Mandal, Joon Ong, Leon Poutievski, Arjun Singh, Subbaiah Venkata, Jim Wanderer, Junlan Zhou, Min Zhu, Jon Zolla, Urs Hölzle, Stephen Stuart, and Amin Vahdat. B4: Experience with a globally-deployed software defined wan. In Proceedings of ACM SIGCOMM, pages 3--14, 2013.
[26]
semi_oblivious_nsdi18Praveen Kumar, Yang Yuan, Chris Yu, Nate Foster, Robert Kleinberg, Petr Lapukhov, Chiun~Lin Lim, and Robert Soulé. Semi-oblivious traffic engineering: The road not taken. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), pages 157--170, 2018.
[27]
n, and Zhang]TONProtection11Kin-Wah Kwong, Lixin Gao, Roch Guérin, and Zhi-Li Zhang. On the feasibility and efficacy of protection routing in ip networks. IEEE/ACM Transactions on Networking, 19 (5): 1543--1556, October 2011.
[28]
Karthik Lakshminarayanan, Matthew Caesar, Murali Rangan, Tom Anderson, Scott Shenker, and Ion Stoica. Achieving convergence-free routing using failure-carrying packets. In Proceedings of ACM SIGCOMM, pages 241--252, 2007.
[29]
Hongqiang~Harry Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, and David Gelernter. Traffic engineering with forward fault correction. In Proceedings of ACM SIGCOMM, pages 527--538, 2014.
[30]
Athina Markopoulou, Gianluca Iannaccone, Supratik Bhattacharyya, Chen-Nee Chuah, Yashar Ganjali, and Christophe Diot. Characterization of failures in an operational ip backbone network. IEEE/ACM Trans. Netw., 16 (4): 749--762, 2008.
[31]
P. Pan, G. Swallow, and A. Atlas. Fast Reroute Extensions to RSVP-TE for LSP Tunnels. RFC 4090, May 2005.
[32]
and Medhi(2004)]MedhiBookMichal Pióro and Deepankar Medhi. Routing, Flow, and Capacity Design in Communication and Computer Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2004. ISBN 0125571895.
[33]
Rahul Potharaju and Navendu Jain. When the network crumbles: An empirical study of cloud network failures and their impact on services. In Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC '13, pages 15:1--15:17, 2013.
[34]
M. Shand and S. Bryant. IP Fast Reroute Framework. RFC 5714, January 2010.
[35]
R. K. Sinha, F. Ergun, K. N. Oikonomou, and K. K. Ramakrishnan. Network design for tolerating multiple link failures using Fast Re-route (FRR). In 2014 10th International Conference on the Design of Reliable Communication Networks (DRCN), pages 1--8, April 2014.
[36]
Martin Suchara, Dahai Xu, Robert Doverspike, David Johnson, and Jennifer Rexford. Network architecture for joint failure recovery and traffic engineering. SIGMETRICS Perform. Eval. Rev., 39 (1): 97--108, 2011.
[37]
Daniel Turner, Kirill Levchenko, Alex~C. Snoeren, and Stefan Savage. California fault lines: Understanding the causes and impact of network failures. In Proceedings of the ACM SIGCOMM 2010 Conference, pages 315--326, 2010.
[38]
Hao Wang, Haiyong Xie, Lili Qiu, Yang~Richard Yang, Yin Zhang, and Albert Greenberg. COPE: Traffic engineering in dynamic networks. In Proceedings of ACM SIGCOMM, pages 99--110, 2006.
[39]
Ye~Wang, Hao Wang, Ajay Mahimkar, Richard Alimi, Yin Zhang, Lili Qiu, and Yang~Richard Yang. R3: Resilient routing reconfiguration. In Proceedings of ACM SIGCOMM, pages 291--302, 2010.
[40]
R.Kevin Wood. Deterministic network interdiction. Mathematical and Computer Modelling, 17 (2): 1--18, January 1993.
[41]
B. Yang, J. Liu, S. Shenker, J. Li, and K. Zheng. Keep forwarding: Towards k-link failure resilient routing. In Proceedings of IEEE INFOCOM, pages 1617--1625, April 2014.
[42]
Zhang, Ge, Kurose, Liu, and Towsley]TrafficMultiMatrixC. Zhang, Zihui Ge, J. Kurose, Y. Liu, and D. Towsley. Optimal routing with multiple traffic matrices tradeoff between average and worst case performance. In Network Protocols, 2005. ICNP 2005. 13th IEEE International Conference on, 2005a.
[43]
Zhang, Ge, Greenberg, and Roughan]gravity_modelYin Zhang, Zihui Ge, Albert Greenberg, and Matthew Roughan. Network anomography. In Proceedings of the 5th ACM SIGCOMM Conference on Internet Measurement, pages 30--30, 2005b.
[44]
Jiaqi Zheng, Hong Xu, Xiaojun Zhu, Guihai Chen, and Yanhui Geng. We've got you covered: Failure recovery with backup tunnels in traffic engineering. In 2016 IEEE 24th International Conference on Network Protocols (ICNP), pages 1--10, 2016.

Cited By

View all
  • (2024)Transferable Neural WAN TE for Changing TopologiesProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672237(86-102)Online publication date: 4-Aug-2024
  • (2024)FERN: Leveraging Graph Attention Networks for Failure Evaluation and Robust Network DesignIEEE/ACM Transactions on Networking10.1109/TNET.2023.331167832:2(1003-1018)Online publication date: Apr-2024
  • (2023)XRON: A Hybrid Elastic Cloud Overlay Network for Video Conferencing at Planetary ScaleProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604845(696-709)Online publication date: 10-Sep-2023
  • Show More Cited By

Index Terms

  1. Lancet: Better Network Resilience by Designing for Pruned Failure Sets

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the ACM on Measurement and Analysis of Computing Systems
      Proceedings of the ACM on Measurement and Analysis of Computing Systems  Volume 3, Issue 3
      SIGMETRICS
      December 2019
      525 pages
      EISSN:2476-1249
      DOI:10.1145/3376928
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 17 December 2019
      Published in POMACS Volume 3, Issue 3

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. network availability
      2. network optimization
      3. protection routing

      Qualifiers

      • Research-article

      Funding Sources

      • NSF CNS

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)36
      • Downloads (Last 6 weeks)8
      Reflects downloads up to 12 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Transferable Neural WAN TE for Changing TopologiesProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672237(86-102)Online publication date: 4-Aug-2024
      • (2024)FERN: Leveraging Graph Attention Networks for Failure Evaluation and Robust Network DesignIEEE/ACM Transactions on Networking10.1109/TNET.2023.331167832:2(1003-1018)Online publication date: Apr-2024
      • (2023)XRON: A Hybrid Elastic Cloud Overlay Network for Video Conferencing at Planetary ScaleProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604845(696-709)Online publication date: 10-Sep-2023
      • (2023)Discovery of Flow Splitting Ratios in ISP Networks with Measurement Noise2023 IEEE 28th Pacific Rim International Symposium on Dependable Computing (PRDC)10.1109/PRDC59308.2023.00017(64-70)Online publication date: 24-Oct-2023
      • (2023)Machine Learning for Robust Network Design: A New PerspectiveIEEE Communications Magazine10.1109/MCOM.002.220067061:10(86-92)Online publication date: Oct-2023
      • (2022)Traffic engineeringProceedings of the Symposium on SDN Research10.1145/3563647.3563652(50-58)Online publication date: 19-Oct-2022
      • (2022)FlexileProceedings of the 18th International Conference on emerging Networking EXperiments and Technologies10.1145/3555050.3569119(110-125)Online publication date: 30-Nov-2022
      • (2022)MeissaProceedings of the ACM SIGCOMM 2022 Conference10.1145/3544216.3544247(350-364)Online publication date: 22-Aug-2022
      • (2022)Probability estimation via policy restrictions, convexification, and approximate samplingMathematical Programming: Series A and B10.1007/s10107-022-01823-6196:1-2(309-345)Online publication date: 1-Nov-2022
      • (2020)Probabilistic Verification of Network ConfigurationsProceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication10.1145/3387514.3405900(750-764)Online publication date: 30-Jul-2020
      • Show More Cited By

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media