Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3341302.3342069acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article
Open access

TEAVAR: striking the right utilization-availability balance in WAN traffic engineering

Published: 19 August 2019 Publication History

Abstract

To keep up with the continuous growth in demand, cloud providers spend millions of dollars augmenting the capacity of their wide-area backbones and devote significant effort to efficiently utilizing WAN capacity. A key challenge is striking a good balance between network utilization and availability, as these are inherently at odds; a highly utilized network might not be able to withstand unexpected traffic shifts resulting from link/node failures. We advocate a novel approach to this challenge that draws inspiration from financial risk theory: leverage empirical data to generate a probabilistic model of network failures and maximize bandwidth allocation to network users subject to an operator-specified availability target. Our approach enables network operators to strike the utilization-availability balance that best suits their goals and operational reality. We present TEAVAR (Traffic Engineering Applying Value at Risk), a system that realizes this risk management approach to traffic engineering (TE). We compare TEAVAR to state-of-the-art TE solutions through extensive simulations across many network topologies, failure scenarios, and traffic patterns, including benchmarks extrapolated from Microsoft's WAN. Our results show that with TEAVAR, operators can support up to twice as much throughput as state-of-the-art TE schemes, at the same level of availability.

Supplementary Material

MP4 File (p29-bogle.mp4)

References

[1]
Ian F. Akyildiz, Ahyoung Lee, Pu Wang, Min Luo, and Wu Chou. 2014. A Roadmap for Traffic Engineering in SDN-OpenFlow Networks. Computer Networks 71 (Oct. 2014), 1--30.
[2]
Mohammad Alizadeh, Tom Edsall, Sarang Dharmapurikar, Ramanan Vaidyanathan, Kevin Chu, Andy Fingerhut, Vinh The Lam, Francis Matus, Rong Pan, Navindra Yadav, and George Varghese. 2014. CONGA: Distributed Congestion-aware Load Balancing for Datacenters. In ACM SIGCOMM (2014). 503--514.
[3]
Fredrik Andersson, Helmut Mausser, Dan Rosen, and Stanislav Uryasev. 2001. Credit risk optimization with conditional value-at-risk criterion. Mathematical Programming 89, 2 (2001), 273--291.
[4]
David Applegate and Edith Cohen. 2013. Making intra-domain routing robust to changing and uncertain traffic demands: Understanding fundamental tradeoffs. In ACM SIGCOMM (2013).
[5]
Ajay Kumar Bangla, Alireza Ghaffarkhah, Ben Preskill, Bikash Koley, Christopher Albrecht, Emilie Danna, Joe Jiang, and Xiaoxue Zhao. 2015. Capacity planning for the Google backbone network. In ISMP (2015).
[6]
Ron Banner and Ariel Orda. 2007. The power of tuning: A novel approach for the efficient design of survivable networks. IEEE/ACM TON (2007).
[7]
Cynthia Barnhart, Niranjan Krishnan, and Pamela H. Vance. 2009. Multicommodity Flow Problems. In Encyclopedia of Optimization. Springer, 2354--2362.
[8]
Theophilus Benson, Ashok Anand, Aditya Akella, and Ming Zhang. 2011. MicroTE: Fine grained traffic engineering for data centers. In ACM CoNEXT (2011).
[9]
Dimitris Bertsimas and Melvyn Sim. 2003. Robust discrete optimization and network flows. Mathematical programming 98, 1--3 (2003), 49--71.
[10]
Jeff Bezanson, Stefan Karpinski, Viral B. Shah, and Alan Edelman. 2012. Julia: A Fast Dynamic Language for Technical Computing. CoRR abs/1209.5145 (2012).
[11]
Yingjie Bi and Ao Tang. 2019. Uncertainty-Aware optimization for Network Provisioning and Routing. In CISS (2019).
[12]
Vladimir L. Boginski, Clayton W. Commander, and Timofey Turko. 2009. Polynomial-time identification of robust network flows under uncertain arc failures. Optimization Letters 3, 3 (2009), 461--473.
[13]
Yiyang Chang, Sanjay Rao, and Mohit Tawarmalani. 2017. Robust Validation of Network Designs under Uncertain Demands and Failures. USENIX NSDI (2017).
[14]
Antonio J. Conejo, Miguel Carrión, Juan M. Morales, et al. 2010. Decision making under uncertainty in electricity markets. Vol. 1. Springer.
[15]
G. A. Corea and V. G. Kulkarni. 1990. Minimum Cost Routing on Stochastic Networks. Operations Research 38, 3 (1990), 527--536.
[16]
Emilie Danna, Subhasree Mandal, and Arjun Singh. 2012. A practical algorithm for balancing the max-min fairness and throughput objectives in traffic engineering. In IEEE INFOCOM (2012).
[17]
Oscar Diaz, Feng Xu, Nasro Min-Allah, Mahmoud Khodeir, Min Peng, Samee Khan, and Nasir Ghani. 2012. Network Survivability for Multiple Probabilistic Failures. IEEE Communications Letters 16, 8 (August 2012), 1320--1323.
[18]
Maxime Dufour, Stefano Paris, Jeremie Leguay, and Moez Draief. 2017. Online Bandwidth Calendaring: On-the-fly admission, scheduling, and path computation. In IEEE ICC (2017).
[19]
Anwar Elwalid, Cheng Jin, Steven H. Low, and Indra Widjaja. 2001. MATE: MPLS adaptive traffic engineering. In IEEE INFOCOM (2001).
[20]
Bernard Fortz, Jennifer Rexford, and Mikkel Thorup. 2002. Traffic engineering with traditional IP routing protocols. IEEE Communications Magazine 40, 10 (Oct. 2002), 118--124.
[21]
Bernard Fortz and Mikkel Thorup. 2000. Internet traffic engineering by optimizing OSPF weights. In IEEE INFOCOM (2000).
[22]
Bernard Fortz and Mikkel Thorup. 2002. Optimizing OSPF/IS-IS weights in a changing world. IEEE journal on selected areas in communications 20, 4 (2002), 756--767.
[23]
Monia Ghobadi and Ratul Mahajan. 2016. Optical Layer Failures in a Large Backbone. In ACM IMC (2016).
[24]
Gregory D. Glockner, George L. Nemhauser, and Craig A. Tovey. 2001. Dynamic Network Flow with Uncertain Arc Capacities: Decomposition Algorithm and Computational Results. Computational Optimization and Applications 18, 3 (1 Mar 2001), 233--250.
[25]
Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley, and Amin Vahdat. 2016. Evolve or Die: High-Availability Design Principles Drawn from Googles Network Infrastructure. In ACM SIGCOMM (2016).
[26]
Zonghao Gu, Edward Rothberg, and Robert Bixby. 2012. Gurobi Optimizer Reference Manual, Version 5.0. Gurobi Optimization Inc., Houston, USA (2012).
[27]
Chi-Yao Hong, Srikanth Kandula, Ratul Mahajan, Ming Zhang, Vijay Gill, Mohan Nanduri, and Roger Wattenhofer. 2013. Achieving high utilization with software-driven WAN. In ACM SIGCOMM (2013).
[28]
Chi-Yao Hong, Subhasree Mandal, Mohammad Al-Fares, Min Zhu, Richard Alimi, Kondapa Naidu Bollineni, Chandan Bhagat, Sourabh Jain, Jay Kaimal, Shiyu Liang, Kirill Mendelev, Steve Padgett, Faro Rabe, Saikat Ray, Malveeka Tewari, Matt Tierney, Monika Zahn, Jonathan Zolla, Joon Ong, and Amin Vahdat. 2018. B4 and After: Managing Hierarchy, Partitioning, and Asymmetry for Availability and Scale in Google's Software-defined WAN. In ACM SIGCOMM (2018).
[29]
Sushant. Jain, Alok Kumar, Subhasree Mandal, Joon Ong, Leon Poutievski, Arjun Singh, Subbaiah Venkata, Jim Wanderer, Junlan Zhou, Min Zhu, Jonathan Zolla, Urs Hölzle, Stephen Stuart, and Amin Vahdat. 2013. B4: Experience with a Globally-deployed Software Defined WAN. In ACM SIGCOMM (2013).
[30]
Virajith Jalaparti, Ivan Bliznets, Srikanth Kandula, Brendan Lucier, and Ishai Men-ache. 2016. Dynamic pricing and traffic engineering for timely inter-datacenter transfers. In ACM SIGCOMM (2016).
[31]
Wenjie Jiang, Rui Zhang-Shen, Jennifer Rexford, and Mung Chiang. 2009. Cooperative content distribution and traffic engineering in an ISP network. In ACM SIGMETRICS (2009).
[32]
Xin Jin, Yiran Li, Da Wei, Siming Li, Jie Gao, Lei Xu, Guangzhi Li, Wei Xu, and Jennifer Rexford. 2016. Optimizing Bulk Transfers with Software-Defined Optical WAN. In ACM SIGCOMM (2016).
[33]
Philippe Jorion. 2001. Value at Risk: The New Benchmark for Managing Financial Risk. McGraw-Hill. https://books.google.com/books?id=S2SsFblvUdMC
[34]
Srikanth Kandula, Dina Katabi, Bruce Davie, and Anna Charny. 2005. Walking the tightrope: Responsive yet stable traffic engineering. In ACM SIGCOMM (2005).
[35]
Srikanth Kandula, Ishai Menache, Roy Schwartz, and Spandana R. Babbula. 2014. Calendaring for wide area networks, Fabián E. Bustamante, Y. Charlie Hu, Arvind Krishnamurthy, and Sylvia Ratnasamy (Eds.). ACM SIGCOMM (2014).
[36]
Fernando A. Kuipers. 2012. An Overview of Algorithms for Network Survivability. ISRN Communications and Networking 2012 (Jan. 2012), 24.
[37]
Alok Kumar, Sushant Jain, Uday Naik, Nikhil Kasinadhuni, Enrique C. Zermeno, C. Stephen Gunn, Jing Ai, Björn Carlin, Mihai Amarandei-Stavila, Mathieu Robin, Aspi Siganporia, Stephen Stuart, and Amin Vahdat. 2015. BwE: Flexible, Hierarchical Bandwidth Allocation for WAN Distributed Computing. In ACM SIGCOMM (2015).
[38]
Praveen Kumar, Yang Yuan, Chris Yu, Nate Foster, Robert Kleinberg, Petr Lapukhov, Ciun L. Lim, and Robert Soulé. 2018. Semi-Oblivious Traffic Engineering: The Road Not Taken. In USENIX NSDI (2018).
[39]
Nikolaos Laoutaris, Michael Sirivianos, Xiaoyuan Yang, and Pablo Rodriguez. 2011. Inter-datacenter Bulk Transfers with Netstitcher. In ACM SIGCOMM (2011).
[40]
Hyang-Won Lee, Eytan Modiano, and Kayi Lee. 2010. Diverse routing in networks with probabilistic failures. IEEE/ACM TON 18, 6 (2010), 1895--1907.
[41]
Youngseok Lee, Yongho Seok, Yanghee Choi, and Changhoon Kim. 2002. A constrained multipath traffic engineering scheme for MPLS networks. In ICC (2002). IEEE, 2431--2436.
[42]
George Leopold. 2017. Building Express Backbone: Facebook's new long-haul network. http://code.facebook.com/posts/1782709872057497/. (2017).
[43]
Hongqiang Harry Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, and David Gelernter. 2014. Traffic engineering with forward fault correction. In ACM SIGCOMM (2014).
[44]
Ajay Mahimkar, Angela Chiu, Robert Doverspike, Mark D. Feuer, Peter Magill, Emmanuil Mavrogiorgis, Jorge Pastor, Sheryl L. Woodward, and Jennifer Yates. 2011. Bandwidth on Demand for Inter-data Center Communication. In ACM HotNets (2011).
[45]
Houra Mahmoudzadeh. 2015. Robust Optimization Methods for Breast Cancer Radiation Therapy. Ph.D. Dissertation. University of Toronto.
[46]
Athina Markopoulou, Gianluca Iannaccone, Supratik Bhattacharyya, Chen N. Chuah, Yashar Ganjali, and Christophe Diot. 2008. Characterization of Failures in an Operational IP Backbone Network. IEEE/ACM TON 16, 4 (Aug 2008), 749--762.
[47]
Debasis Mitra and Qiong Wang. 2005. Stochastic traffic engineering for demand uncertainty and risk-aware network revenue management. IEEE/ACM TON 13, 2 (2005), 221--233.
[48]
Jeffrey C. Mogul, Rebecca Isaacs, and Brent Welch. 2017. Thinking about Availability in Large Service Infrastructures. In ACM HotOS (2017).
[49]
Dritan Nace and Michal Pióro. 2008. Max-min fairness and its applications to routing and load-balancing in communication networks: A tutorial. IEEE Communications Surveys and Tutorials 10, 1--4 (2008), 5--17.
[50]
R. Tyrrell Rockafellar and Stanislav Uryasev. 2000. Optimization of conditional value-at-risk. Journal of risk 2 (2000), 21--42.
[51]
R. Tyrrell Rockafellar and Stanislav Uryasev. 2002. Conditional value-at-risk for general loss distributions. Journal of banking & finance 26, 7 (2002), 1443--1471.
[52]
Sergey Sarykalin, Gaia Serraino, and Stan Uryasev. 2008. Value-at-Risk vs. Conditional Value-at-Risk in Risk Management and Optimization.
[53]
Farhad Shahrokhi and David W. Matula. 1990. The Maximum Concurrent Flow Problem. ACM 37 (1990), 318--334.
[54]
Martin Suchara, Dahai Xu, Robert Doverspike, David Johnson, and Jennifer Rexford. 2011. Network Architecture for Joint Failure Recovery and Traffic Engineering. In ACM SIGMETRICS (2011).
[55]
Paul Tune and Matthew Roughan. 2017. Controlled Synthesis of Traffic Matrices. IEEE/ACM TON (2017).
[56]
Bruno Vidalenc, Laurent Ciavaglia, Ludovic Noirie, and Eric Renault. 2013. Dynamic risk-aware routing for OSPF networks. In IFIP/IEEE IM (2013). 226--234.
[57]
Hong Zhang, Kai Chen, Wei Bai, Dongsu Han, Chen Tian, Hao Wang, Haibing Guan, and Ming Zhang. 2017. Guaranteeing deadlines for inter-data center transfers. IEEE/ACM TON (2017).

Cited By

View all
  • (2024)Reasoning about network traffic load property at production scaleProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691884(1063-1081)Online publication date: 16-Apr-2024
  • (2024)Measurement-Noise Filtering for Automatic Discovery of Flow Splitting Ratios in ISP NetworksFormal Aspects of Computing10.1145/370060236:4(1-18)Online publication date: 15-Oct-2024
  • (2024)ROND: Rethinking Overlay Network Design with Underlay Network AwarenessProceedings of the ACM on Networking10.1145/36562982:CoNEXT2(1-22)Online publication date: 13-Jun-2024
  • Show More Cited By

Index Terms

  1. TEAVAR: striking the right utilization-availability balance in WAN traffic engineering

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        SIGCOMM '19: Proceedings of the ACM Special Interest Group on Data Communication
        August 2019
        526 pages
        ISBN:9781450359566
        DOI:10.1145/3341302
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 19 August 2019

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. availability
        2. network optimization
        3. traffic engineering
        4. utilization

        Qualifiers

        • Research-article

        Conference

        SIGCOMM '19
        Sponsor:
        SIGCOMM '19: ACM SIGCOMM 2019 Conference
        August 19 - 23, 2019
        Beijing, China

        Acceptance Rates

        Overall Acceptance Rate 462 of 3,389 submissions, 14%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)505
        • Downloads (Last 6 weeks)83
        Reflects downloads up to 24 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Reasoning about network traffic load property at production scaleProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691884(1063-1081)Online publication date: 16-Apr-2024
        • (2024)Measurement-Noise Filtering for Automatic Discovery of Flow Splitting Ratios in ISP NetworksFormal Aspects of Computing10.1145/370060236:4(1-18)Online publication date: 15-Oct-2024
        • (2024)ROND: Rethinking Overlay Network Design with Underlay Network AwarenessProceedings of the ACM on Networking10.1145/36562982:CoNEXT2(1-22)Online publication date: 13-Jun-2024
        • (2024)FIGRET: Fine-Grained Robustness-Enhanced Traffic EngineeringProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672258(117-135)Online publication date: 4-Aug-2024
        • (2024)A General and Efficient Approach to Verifying Traffic Load Properties under Arbitrary k FailuresProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672246(228-243)Online publication date: 4-Aug-2024
        • (2024)MegaTE: Extending WAN Traffic Engineering to Millions of Endpoints in Virtualized CloudProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672242(103-116)Online publication date: 4-Aug-2024
        • (2024)Cost-Saving Streaming: Unlocking the Potential of Alternative Edge Node ResourcesProceedings of the 2024 ACM on Internet Measurement Conference10.1145/3646547.3689025(580-587)Online publication date: 4-Nov-2024
        • (2024)Improving Scalability in Traffic Engineering via Optical Topology ProgrammingIEEE Transactions on Network and Service Management10.1109/TNSM.2023.333589821:2(1581-1600)Online publication date: Apr-2024
        • (2024)EPIC: Traffic Engineering-Centric Path Programmability Recovery Under Controller Failures in SD-WANsIEEE/ACM Transactions on Networking10.1109/TNET.2024.343829232:6(4871-4884)Online publication date: Dec-2024
        • (2024)Maintaining Control Resiliency for Traffic Engineering in SD-WANsIEEE/ACM Transactions on Networking10.1109/TNET.2024.339384132:4(3485-3498)Online publication date: Aug-2024
        • Show More Cited By

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Login options

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media