research-article

Transferable Neural WAN TE for Changing Topologies

Authors: Abd AlRhman AlQiam, Yuanjun Yao, Zhaodong Wang, Satyajeet Singh Ahuja, Ying Zhang, Sanjay G. Rao, Bruno Ribeiro, Mohit TawarmalaniAuthors Info & Claims

ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 Conference

Pages 86 - 102

https://doi.org/10.1145/3651890.3672237

Published: 04 August 2024 Publication History

Abstract

Recently, researchers have proposed ML-driven traffic engineering (TE) schemes where a neural network model is used to produce TE decisions in lieu of conventional optimization solvers. Unfortunately existing ML-based TE schemes are not explicitly designed to be robust to topology changes that may occur due to WAN evolution, failures or planned maintenance. In this paper, we present HARP, a neural model for TE explicitly capable of handling variations in topology including those not observed in training. HARP is designed with two principles in mind: (i) ensure invariances to natural input transformations (e.g., permutations of node ids, tunnel reordering); and (ii) align neural architecture to the optimization model. Evaluations on a multi-week dataset of a large private WAN show HARP achieves an MLU at most 11% higher than optimal over 98% of the time despite encountering significantly different topologies in testing relative to training data. Further, comparisons with state-of-the-art ML-based TE schemes indicate the importance of the mechanisms introduced by HARP to handle topology variability. Finally, when predicted traffic matrices are provided, HARP outperforms classic optimization solvers achieving a median reduction in MLU of 5 to 10% on the true traffic matrix.

References

[1]

Abilene traffic matrices. http://www.cs.utexas.edu/~yzhang/research/AbileneTM/.

[2]

Firas Abuzaid, Srikanth Kandula, Behnaz Arzani, Ishai Menache, Matei Zaharia, and Peter Bailis. Contracting wide-area network topologies to solve flow problems quickly. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), pages 175--200. USENIX Association, April 2021.

[3]

David Applegate and Edith Cohen. Making intra-domain routing robust to changing and uncertain traffic demands: Understanding fundamental tradeoffs. In Proceedings of ACM SIGCOMM, pages 313--324, 2003.

Digital Library

[4]

David L. Applegate, Mateo D'iaz, Oliver Hinder, Haihao Lu, Miles Lubin, Brendan O'Donoghue, and Warren Schudy. Practical large-scale linear programming using primal-dual hybrid gradient. In Neural Information Processing Systems, 2021.

[5]

Beatrice Bevilacqua, Kyriacos Nikiforou, Borja Ibarz, Ioana Bica, Michela Paganini, Charles Blundell, Jovana Mitrovic, and Petar Veličković. Neural algorithmic reasoning with causal regularisation. ICML, 2023.

[6]

Beatrice Bevilacqua, Yangze Zhou, and Bruno Ribeiro. Size-invariant graph representations for graph classification extrapolations. In International Conference on Machine Learning. PMLR, 2021.

[7]

Jeremy Bogle, Nikhil Bhatia, Manya Ghobadi, Ishai Menache, Nikolaj Bjorner, Asaf Valadarsky, and Michael Schapira. Teavar: Striking the right utilization-availability balance in wan traffic engineering. In Proceedings of ACM SIGCOMM, 2019.

[8]

Yiyang Chang, Chuan Jiang, Ashish Chandra, Sanjay Rao, and Mohit Tawarmalani. Lancet: Better network resilience by designing for pruned failure sets. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3:1--26, 12 2019.

Digital Library

[9]

Cédric Colas, Olivier Sigaud, and Pierre-Yves Oudeyer. How many random seeds? statistical power analysis in deep reinforcement learning experiments. arXiv preprint arXiv:1806.08295, 2018.

[10]

E. Danna, S. Mandal, and A. Singh. A practical algorithm for balancing the max-min fairness and throughput objectives in traffic engineering. In Proceedings of IEEE INFOCOM, pages 846--854, 2012.

[11]

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35:30318--30332, 2022.

[12]

Google Developers. Choice of solvers and algorithms. https://developers.google.com/optimization/lp/lp_advanced#choice_of_solvers_and_algorithms/, 2024.

[13]

Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with pytorch geometric. https://github.com/pyg-team/pytorch_geometric, 05 2019.

[14]

Bernard Fortz and Mikkel Thorup. Robust optimization of OSPF/IS-IS weights. In Proceedings of International Network Optimization Conference, pages 225--230, 2003.

[15]

Nan Geng, Mingwei Xu, Yuan Yang, Chenyi Liu, Jiahai Yang, Qi Li, and Shize Zhang. Distributed and adaptive traffic engineering with deep reinforcement learning. In 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS), pages 1--10, 2021.

[16]

A. Ghosh, Sangtae Ha, E. Crabbe, and J. Rexford. Scalable multi-class traffic management in data center backbone networks. IEEE Journal on Selected Areas in Communications, 31:2673--2684, 2013.

[17]

Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. Understanding network failures in data centers: Measurement, analysis, and implications. In Proceedings of ACM SIGCOMM, pages 350--361, 2011.

Digital Library

[18]

Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley, and Amin Vahdat. Evolve or die: High-availability design principles drawn from googles network infrastructure. In Proceedings of ACM SIGCOMM, pages 58--72, 2016.

Digital Library

[19]

Michael Gschwind, Driss Guessous, and Christian Puhrsch. Accelerated pytorch 2 transformers. https://pytorch.org/blog/accelerated-pytorch-2/, March 2023.

[20]

Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023.

[21]

Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.

[22]

Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters, 2019.

[23]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.

[24]

Chi-Yao Hong, Srikanth Kandula, Ratul Mahajan, Ming Zhang, Vijay Gill, Mohan Nanduri, and Roger Wattenhofer. Achieving high utilization with software-driven wan. In Proceedings of ACM SIGCOMM, pages 15--26, 2013.

Digital Library

[25]

Chi-Yao Hong, Subhasree Mandal, Mohammad Al-Fares, Min Zhu, Richard Alimi, Kondapa Naidu B., Chandan Bhagat, Sourabh Jain, Jay Kaimal, Shiyu Liang, Kirill Mendelev, Steve Padgett, Faro Rabe, Saikat Ray, Malveeka Tewari, Matt Tierney, Monika Zahn, Jonathan Zolla, Joon Ong, and Amin Vahdat. B4 and after: Managing hierarchy, partitioning, and asymmetry for availability and scale in google's software-defined wan. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, pages 74--87, 2018.

Digital Library

[26]

Sushant Jain, Alok Kumar, Subhasree Mandal, Joon Ong, Leon Poutievski, Arjun Singh, Subbaiah Venkata, Jim Wanderer, Junlan Zhou, Min Zhu, Jon Zolla, Urs Hölzle, Stephen Stuart, and Amin Vahdat. B4: Experience with a globally-deployed software defined wan. In Proceedings of ACM SIGCOMM, pages 3--14, 2013.

[27]

Chuan Jiang, Zixuan Li, Sanjay Rao, and Mohit Tawarmalani. Flexile: Meeting bandwidth objectives almost always. In Proceedings of the 18th International Conference on Emerging Networking EXperiments and Technologies, CoNEXT '22, page 110--125, New York, NY, USA, 2022. Association for Computing Machinery.

[28]

Chuan Jiang, Sanjay Rao, and Mohit Tawarmalani. Pcf: Provably resilient flexible routing. In Proceedings of ACM SIGCOMM, page 139--153, 2020.

[29]

Grigorios Kakkavas, Michail Kalntis, Vasileios Karyotis, and Symeon Papavassiliou. Future network traffic matrix synthesis and estimation based on deep generative models. In 2021 International Conference on Computer Communications and Networks (ICCCN), pages 1--8, 2021.

[30]

Srikanth Kandula, Dina Katabi, Bruce Davie, and Anna Charny. Walking the tightrope: responsive yet stable traffic engineering. In Proceedings of the 2005 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, SIGCOMM '05, page 253--264, New York, NY, USA, 2005. Association for Computing Machinery.

Digital Library

[31]

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.

[32]

Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017.

[33]

Simon Knight, Hung Nguyen, Nickolas Falkner, Rhys Bowden, and Matthew Roughan. The internet topology zoo. IEEE Journal on Selected Areas in Communications, 29:1765 -- 1775, October 2011.

[34]

Alok Kumar, Sushant Jain, Uday Naik, Nikhil Kasinadhuni, Enrique Cauich Zermeno, C. Stephen Gunn, Jing Ai, Björn Carlin, Mihai Amarandei-Stavila, Mathieu Robin, Aspi Siganporia, Stephen Stuart, and Amin Vahdat. Bwe: Flexible, hierarchical bandwidth allocation for wan distributed computing. In Proceedings of ACM SIGCOMM, 2015.

[35]

Praveen Kumar, Yang Yuan, Chris Yu, Nate Foster, Robert Kleinberg, Petr Lapukhov, Chiun Lin Lim, and Robert Soulé. Semi-oblivious traffic engineering: The road not taken. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), pages 157--170, 2018.

[36]

Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 3744--3753. PMLR, 09--15 Jun 2019.

[37]

Hongqiang Harry Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, and David Gelernter. Traffic engineering with forward fault correction. In Proceedings of ACM SIGCOMM, pages 527--538, 2014.

Digital Library

[38]

Libin Liu, Li Chen, Hong Xu, and Hua Shao. Automated traffic engineering in sdwan: Beyond reinforcement learning. In IEEE INFOCOM 2020 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pages 430--435, 2020.

[39]

T. Mallick, M. Kiran, B. Mohammed, and P. Balaprakash. Dynamic graph neural network for traffic forecasting in wide area networks. In 2020 IEEE International Conference on Big Data (Big Data), pages 1--10, Los Alamitos, CA, USA, dec 2020. IEEE Computer Society.

[40]

Athina Markopoulou, Gianluca Iannaccone, Supratik Bhattacharyya, Chen-Nee Chuah, Yashar Ganjali, and Christophe Diot. Characterization of failures in an operational ip backbone network. IEEE/ACM Trans. Netw., 16(4):749--762, 2008.

Digital Library

[41]

Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440, 2016.

[42]

Christopher Morris, Martin Ritzert, Matthias Fey, William L Hamilton, Jan Eric Lenssen, Gaurav Rattan, and Martin Grohe. Weisfeiler and leman go neural: Higher-order graph neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 4602--4609, 2019.

Digital Library

[43]

Pooria Namyar, Behnaz Arzani, Srikanth Kandula, Santiago Segarra, Daniel Crankshaw, Umesh Krishnaswamy, Ramesh Govindan, and Himanshu Raj. Solving Max-Min fair resource allocations quickly on large graphs. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 1937--1958, Santa Clara, CA, April 2024. USENIX Association.

[44]

Laisen Nie, Dingde Jiang, Lei Guo, and Shui Yu. Traffic matrix prediction and estimation based on deep learning in large-scale ip backbone networks. Journal of Network and Computer Applications, 76:16--22, 2016.

Digital Library

[45]

S. Orlowski, M. Pióro, A. Tomaszewski, and R. Wessäly. SNDlib 1.0--Survivable Network Design Library. In Proceedings of the 3rd International Network Optimization Conference (INOC 2007), Spa, Belgium, April 2007. http://sndlib.zib.de, extended version accepted in Networks, 2009.

[46]

S. Orlowski, M. Pióro, A. Tomaszewski, and R. Wessäly. SNDlib 1.0--Survivable Network Design Library. Networks, 55(3):276--286, 2010.

Digital Library

[47]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

[48]

Yarin Perry, Felipe Vieira Frujeri, Chaim Hoch, Srikanth Kandula, Ishai Menache, Michael Schapira, and Aviv Tamar. DOTE: Rethinking (predictive) WAN traffic engineering. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 1557--1581, Boston, MA, April 2023. USENIX Association.

[49]

Michal Pióro and Deepankar Medhi. Routing, Flow, and Capacity Design in Communication and Computer Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2004.

Digital Library

[50]

Rahul Potharaju and Navendu Jain. When the network crumbles: An empirical study of cloud network failures and their impact on services. In Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC '13, pages 15:1--15:17, 2013.

Digital Library

[51]

Krzysztof Rusek, José Suárez-Varela, Albert Mestres, Pere Barlet-Ros, and Albert Cabellos-Aparicio. Unveiling the potential of graph neural networks for network modeling and optimization in sdn. In Proceedings of the 2019 ACM Symposium on SDN Research, SOSR '19, page 140--151, New York, NY, USA, 2019. Association for Computing Machinery.

Digital Library

[52]

Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE transactions on neural networks, 20(1):61--80, 2008.

[53]

Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage. California fault lines: Understanding the causes and impact of network failures. In Proceedings of the ACM SIGCOMM 2010 Conference, pages 315--326, 2010.

Digital Library

[54]

Steve Uhlig, Bruno Quoitin, Jean Lepropre, and Simon Balon. Providing public intradomain traffic matrices to the research community. SIGCOMM Comput. Commun. Rev., 36(1):83--86, jan 2006.

Digital Library

[55]

Asaf Valadarsky, Michael Schapira, Dafna Shahaf, and Aviv Tamar. Learning to route. In Proceedings of the 16th ACM Workshop on Hot Topics in Networks, HotNets '17, page 185--191, New York, NY, USA, 2017. Association for Computing Machinery.

Digital Library

[56]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.

Digital Library

[57]

Petar Velivckovi'c, Adrià Puigdomènech Badia, David Budden, Razvan Pascanu, Andrea Banino, Mikhail Dashevskiy, Raia Hadsell, and Charles Blundell. The clrs algorithmic reasoning benchmark. In International Conference on Machine Learning, 2022.

[58]

Ye Wang, Hao Wang, Ajay Mahimkar, Richard Alimi, Yin Zhang, Lili Qiu, and Yang Richard Yang. R3: Resilient routing reconfiguration. In Proceedings of ACM SIGCOMM, pages 291--302, 2010.

[59]

Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? ICLR, 2019.

[60]

Shenghe Xu, Murali Kodialam, T. V. Lakshman, and Shivendra S. Panwar. Learning based methods for traffic matrix estimation from link measurements. IEEE Open Journal of the Communications Society, 2:488--499, 2021.

[61]

Zhiying Xu, Francis Y. Yan, Rachee Singh, Justin T. Chiu, Alexander M. Rush, and Minlan Yu. Teal: Learning-accelerated optimization of wan traffic engineering. In Proceedings of the ACM SIGCOMM 2023 Conference, ACM SIGCOMM '23, page 378--393, New York, NY, USA, 2023. Association for Computing Machinery.

Digital Library

[62]

Zhiyuan Xu, Jian Tang, Jingsong Meng, Weiyi Zhang, Yanzhi Wang, Chi Harold Liu, and Dejun Yang. Experience-driven networking: A deep reinforcement learning based approach. In IEEE INFOCOM 2018 - IEEE Conference on Computer Communications, page 1871--1879. IEEE Press, 2018.

Digital Library

[63]

Junjie Zhang, Minghao Ye, Zehua Guo, Chen-Yu Yen, and H. Jonathan Chao. Cfr-rl: Traffic engineering with reinforcement learning in sdn. IEEE Journal on Selected Areas in Communications, 38(10):2249--2259, 2020.

[64]

Hang Zhu, Varun Gupta, Satyajeet Singh Ahuja, Yuandong Tian, Ying Zhang, and Xin Jin. Network planning with deep reinforcement learning. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference, SIGCOMM '21, page 258--271, New York, NY, USA, 2021. Association for Computing Machinery.

Digital Library

Index Terms

Transferable Neural WAN TE for Changing Topologies
1. Computing methodologies
  1. Machine learning
2. Networks
  1. Network algorithms
    1. Control path algorithms
      1. Traffic engineering algorithms

Recommendations

Teal: Learning-Accelerated Optimization of WAN Traffic Engineering
ACM SIGCOMM '23: Proceedings of the ACM SIGCOMM 2023 Conference

The rapid expansion of global cloud wide-area networks (WANs) has posed a challenge for commercial optimization engines to efficiently solve network traffic engineering (TE) problems at scale. Existing acceleration strategies decompose TE optimization ...
FIGRET: Fine-Grained Robustness-Enhanced Traffic Engineering
ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 Conference

Traffic Engineering (TE) is critical for improving network performance and reliability. A key challenge in TE is the management of sudden traffic bursts. Existing TE schemes either do not handle traffic bursts or uniformly guard against traffic bursts, ...
MegaTE: Extending WAN Traffic Engineering to Millions of Endpoints in Virtualized Cloud
ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 Conference

In today's virtualized cloud, containers and virtual machines (VMs) are prevailing methods to deploy applications with different tenant requirements. However, these requirements are at odds with the resource allocation capabilities of conventional ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 Conference

August 2024

1033 pages

ISBN:9798400706141

DOI:10.1145/3651890

Co-chairs:
Aruna Seneviratne,
Darryl Veitch,
Program Co-chairs:
Vyas Sekar,
Minlan Yu

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGCOMM: ACM Special Interest Group on Data Communication

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 August 2024

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

ACM SIGCOMM '24

Sponsor:

SIGCOMM

ACM SIGCOMM '24: ACM SIGCOMM 2024 Conference

August 4 - 8, 2024

NSW, Sydney, Australia

Acceptance Rates

Overall Acceptance Rate 462 of 3,389 submissions, 14%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 28 Jul 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents