Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3651890.3672243acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article
Open access

m3: Accurate Flow-Level Performance Estimation using Machine Learning

Published: 04 August 2024 Publication History

Abstract

Data center network operators often need accurate estimates of aggregate network performance. Unfortunately, existing methods for estimating aggregate network statistics are either inaccurate or too slow to be practical at the data center scale.
In this paper, we develop and evaluate a scale-free, fast, and accurate model for estimating data center network tail latency performance for a given workload, topology, and network configuration. First, we show that path-level simulations---simulations of traffic that intersects a given path---produce almost the same aggregate statistics as full network-wide packet-level simulations. We use a simple and fast flow-level fluid simulation in a novel way to capture and summarize essential elements of the path workload, including the effect of cross-traffic on flows on that path. We use this coarse simulation as input to a machine-learning model to predict path-level behavior, and run it on a sample of paths to produce accurate network-wide estimates. Our model generalizes over the choice of congestion control (CC) protocol, CC protocol parameters, and routing. Relative to Parsimon, a state-of-the-art system for rapidly estimating aggregate network tail latency, our approach is significantly faster (5.7×), more accurate (45.9% less error), and more robust.

References

[1]
Mohammad Alizadeh, Tom Edsall, Sarang Dharmapurikar, Ramanan Vaidyanathan, Kevin Chu, Andy Fingerhut, Vinh The Lam, Francis Matus, Rong Pan, Navindra Yadav, and George Varghese. 2014. CONGA: distributed congestion-aware load balancing for datacenters. In Proceedings of the ACM Special Interest Group on Data Communication (SIGCOMM). 503--514.
[2]
Mohammad Alizadeh, Adel Javanmard, and Balaji Prabhakar. 2011. Analysis of DCTCP: stability, convergence, and fairness. In Proceedings of the ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems. 73--84.
[3]
Marina Alonso, Salvador Coll, Juan-Miguel Martínez, Vicente Santonja, and Pedro López. 2015. Power consumption management in fat-tree interconnection networks. Parallel computing 48 (2015), 59--80.
[4]
Søren Asmussen and Ger Koole. 1993. Marked point processes as limits of Markovian arrival streams. Journal of Applied Probability 30, 2 (1993), 365--372.
[5]
Chen Avin, Manya Ghobadi, Chen Griner, and Stefan Schmid. 2020. On the complexity of traffic traces and implications. Proceedings of the ACM on Measurement and Analysis of Computing Systems (2020), 47--48.
[6]
François Baccelli and Dohy Hong. 2003. Flow level simulation of large IP networks. In Proceedings of the IEEE Conference on Computer Communications (INFOCOM). 1911--1921.
[7]
Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations (ICLR).
[8]
Gunter Bolch, Stefan Greiner, Hermann De Meer, and Kishor S Trivedi. 2006. Queueing networks and Markov chains: modeling and performance evaluation with computer science applications. John Wiley & Sons, Ltd. 821--868 pages.
[9]
Jiaxin Cao, Rui Xia, Pengkun Yang, Chuanxiong Guo, Guohan Lu, Lihua Yuan, Yixin Zheng, Haitao Wu, Yongqiang Xiong, and Dave Maltz. 2013. Per-packet load-balanced, low-latency routing for clos-based data center networks. In Proceedings of the ACM Conference on Emerging Networking Experiments and Technologies. 49--60.
[10]
Srinivas R. Chakravarthy. 2011. Markovian Arrival Processes. John Wiley & Sons, Ltd.
[11]
Florin Ciucu and Jens Schmitt. 2012. Perspectives on network calculus: no free lunch, but still good value. In Proceedings of the ACM Special Interest Group on Data Communication (SIGCOMM). 311--322.
[12]
Bruno Klaus de Aquino Afonso and Lilian Berton. 2022. QT-Routenet: Improved GNN generalization to larger 5G networks by fine-tuning predictions from queueing theory. ITU Journal on Future and Evolving Technologies 3, 2 (2022), 134--141.
[13]
Do Young Eun. 2005. On the limitation of fluid-based approach for Internet congestion control. In Proceedings of the International Conference On Computer Communications and Networks, ICCCN. 463--468.
[14]
Facebookresearch. Retrieved by Feb 1st 2024. Inference code for LLaMA models. In https://github.com/facebookresearch/llama/blob/main/llama/model.py.
[15]
Miquel Ferriol-Galmés, Jordi Paillisse, José Suárez-Varela, Krzysztof Rusek, Shihan Xiao, Xiang Shi, Xiangle Cheng, Pere Barlet-Ros, and Albert Cabellos-Aparicio. 2023. RouteNet-Fermi: Network Modeling With Graph Neural Networks. IEEE/ACM Transactions on Networking 31, 6 (2023), 3080--3095.
[16]
Miquel Ferriol-Galmés, Krzysztof Rusek, José Suárez-Varela, Shihan Xiao, Xiang Shi, Xiangle Cheng, Bo Wu, Pere Barlet-Ros, and Albert Cabellos-Aparicio. 2022. RouteNet-Erlang: A Graph Neural Network for Network Performance Evaluation. In Proceedings of the IEEE Conference on Computer Communications (INFOCOM). 2018--2027.
[17]
Richard M Fujimoto. 1990. Parallel discrete event simulation. Commun. ACM 33, 10 (1990), 30--53.
[18]
Richard M Fujimoto. 2001. Parallel and distributed simulation systems. In Proceeding of the Winter Simulation Conference. 147--157 vol.1.
[19]
Kaihui Gao, Li Chen, Dan Li, Vincent Liu, Xizheng Wang, Ran Zhang, and Lu Lu. 2023. DONS: Fast and Affordable Discrete Event Network Simulation with Automatic Parallelization. In Proceedings of the ACM Special Interest Group on Data Communication (SIGCOMM). 167--181.
[20]
Martin Happ, Jia Lei Du, Matthias Herlich, Christian Maier, Peter Dorfinger, and José Suárez-Varela. 2022. Exploring the Limitations of Current Graph Neural Networks for Network Modeling. In Proceedings of the IEEE/IFIP Network Operations and Management Symposium. 1--8.
[21]
Gábor Horváth, B Van Houdt, and M Telek. 2014. Commuting matrices in the queue length and sojourn time analysis of MAP/MAP/1 queues. Stochastic Models 30, 4 (2014), 554--575.
[22]
Broadcom Inc. Retrieved by Feb 1st 2024. htsim Network Simulator. In https://github.com/Broadcom/csg-htsim.
[23]
Shafagh Jafer, Qi Liu, and Gabriel Wainer. 2013. Synchronization methods in parallel and distributed discrete-event simulation. Simulation Modelling Practice and Theory 30 (2013), 54--73.
[24]
Charles W. Kazer, Jo ao Sedoc, Kelvin K.W. Ng, Vincent Liu, and Lyle H. Ungar. 2018. Fast Network Simulation Through Approximation or: How Blind Men Can Describe Elephants. In Proceedings of the ACM Workshop on Hot Topics in Networks. 141--147.
[25]
Abbas Eslami Kiasari, Zhonghai Lu, and Axel Jantsch. 2013. An Analytical Latency Model for Networks-on-Chip. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 21, 1 (2013), 113--123.
[26]
Alexander Klemm, Christoph Lindemann, and Marco Lohmann. 2003. Modeling IP traffic using the batch Markovian arrival process. Performance Evaluation 54, 2 (2003), 149--173.
[27]
Jean-Yves Le Boudec and Patrick Thiran. 2001. Network calculus: a theory of deterministic queuing systems for the internet. Springer-Verlag, Berlin, Heidelberg.
[28]
Yuliang Li, Rui Miao, Hongqiang Harry Liu, Yan Zhuang, Fei Feng, Lingbo Tang, Zheng Cao, Ming Zhang, Frank Kelly, Mohammad Alizadeh, and Minlan Yu. 2019. HPCC: High Precision Congestion Control. In Proceedings of the ACM Special Interest Group on Data Communication (SIGCOMM). 44--58.
[29]
Yu Liu, Boleslaw K. Szymanski, and Adnan Saifee. 2006. Genesis: A scalable distributed system for large-scale parallel network simulation. Computer Networks 50, 12 (2006), 2028--2053.
[30]
Zheng Lu and Hongji Yang. 2012. Unlocking the power of OPNET modeler. Cambridge University Press. 1--238 pages.
[31]
Sumit K. Mandal, Raid Ayoub, Micahel Kishinevsky, Mohammad M. Islam, and Umit Y. Ogras. 2021. Analytical Performance Modeling of NoCs under Priority Arbitration and Bursty Traffic. IEEE Embedded Systems Letters 13, 3 (2021), 98--101.
[32]
Marco Ajmone Marsan, Michele Garetto, Paolo Giaccone, Emilio Leonardi, Enrico Schiattarella, and Alessandro Tarello. 2004. Using partial differential equations to model TCP mice and elephants in large IP networks. In Proceedings of the IEEE Conference on Computer Communications (INFOCOM). 2821--2832 vol.4.
[33]
Laurent Massoulié and James Roberts. 1999. Bandwidth sharing: objectives and algorithms. In Proceedings of the IEEE Conference on Computer Communications (INFOCOM). 1395--1403 vol.3.
[34]
Hiroyuki Masuyama and Tetsuya Takine. 2003. Sojourn time distribution in a MAP/M/1 processor-sharing queue. Operations Research Letters 31, 5 (2003), 406--412.
[35]
Vishal Misra, Wei-Bo Gong, and Don Towsley. 2000. Fluid-based analysis of a network of AQM routers supporting TCP flows with an application to RED. In Proceedings of the ACM Special Interest Group on Data Communication (SIGCOMM). 151--160.
[36]
Radhika Mittal, Vinh The Lam, Nandita Dukkipati, Emily Blem, Hassan Wassel, Monia Ghobadi, Amin Vahdat, Yaogong Wang, David Wetherall, and David Zats. 2015. TIMELY: RTT-based congestion control for the datacenter. Proceedings of the ACM Special Interest Group on Data Communication (SIGCOMM), 537--550.
[37]
Pooria Namyar, Behnaz Arzani, Srikanth Kandula, Santiago Segarra, Daniel Crankshaw, Umesh Krishnaswamy, Ramesh Govindan, and Himanshu Raj. 2024. Solving Max-Min Fair Resource Allocations Quickly on Large Graphs. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI). 1937--1958.
[38]
Shruti Narayana, Emily Shriver, Kenneth O'Neal, Nuriye Yildirim, Khamida Begaliyeva, and Umit Ogras. 2023. Similarity-Based Fast Analysis of Data Center Networks. IEEE Design & Test PP, 1--1.
[39]
Shruti Yadav Narayana, Emily Shriver, Kenneth O'Neal, Nuriye Yildirim, Khamida Begaliyeva, and Umit Y Ogras. 2023. Similarity-Based Fast Analysis of Data Center Networks. IEEE Design & Test (2023).
[40]
Shruti Yadav Narayana, Jie Tong, Anish Krishnakumar, Nuriye Yildirim, Emily Shriver, Mahesh Ketkar, and Umit Y. Ogras. 2023. MQL: ML-Assisted Queuing Latency Analysis for Data Center Networks. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 50--60.
[41]
David Nicol and Richard Fujimoto. 1994. Parallel simulation today. Annals of Operations Research 53 (1994), 249--285.
[42]
Umit Y. Ogras, Paul Bogdan, and Radu Marculescu. 2010. An Analytical Approach for Network-on-Chip Performance Analysis. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 29, 12 (2010), 2001--2013.
[43]
Qiuyu Peng, Anwar Walid, Jaehyun Hwang, and Steven H. Low. 2016. Multipath TCP: Analysis, Design, and Implementation. 24, 1 (2016), 596--609.
[44]
Xi Peng, Fan Zhang, Li Chen, and Gong Zhang. 2021. A MAP-based Performance Analysis on 5G-powered Cloud VR Streaming. In Proceedings of the IEEE International Conference on Communications (ICC). 1--6.
[45]
Leon Poutievski, Omid Mashayekhi, Joon Ong, Arjun Singh, Mukarram Tariq, Rui Wang, Jianan Zhang, Virginia Beauregard, Patrick Conner, Steve Gribble, Rishi Kapoor, Stephen Kratzer, Nanfang Li, Hong Liu, Karthik Nagaraj, Jason Ornstein, Samir Sawhney, Ryohei Urata, Lorenzo Vicisano, Kevin Yasumura, Shidong Zhang, Junlan Zhou, and Amin Vahdat. 2022. Jupiter Evolving: Transforming Google's Datacenter Network via Optical Circuit Switches and Software-Defined Networking. In Proceedings of the ACM Special Interest Group on Data Communication (SIGCOMM). 66--85.
[46]
George F. Riley and Thomas R. Henderson. 2010. The ns-3 Network Simulator. Springer Berlin Heidelberg. 15--34 pages.
[47]
Thomas G. Robertazzi. 2000. Computer Networks and Systems: Queueing Theory and Performance Evaluation. Springer-Verlag.
[48]
Arjun Roy, Hongyi Zeng, Jasmeet Bagga, George Porter, and Alex C Snoeren. 2015. Inside the Social Network's (Datacenter) Network. In Proceedings of the ACM Special Interest Group on Data Communication (SIGCOMM). 123--137.
[49]
Krzysztof Rusek, José Suárez-Varela, Paul Almasan, Pere Barlet-Ros, and Albert Cabellos-Aparicio. 2020. RouteNet: Leveraging graph neural networks for network modeling and optimization in SDN. IEEE Journal on Selected Areas in Communications 38, 10 (2020), 2260--2270.
[50]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023). https://arxiv.org/abs/2307.09288
[51]
Andras Varga. 2019. A Practical Introduction to the OMNeT++ Simulation Framework. Springer International Publishing. 3--51 pages.
[52]
Bob Wheeler. 2019. Tomahawk 4 switch first to 25.6 Tbps. Microprocessor Report (2019). https://docs.broadcom.com/doc/12398014
[53]
Qingqing Yang, Xi Peng, Li Chen, Libin Liu, Jingze Zhang, Hong Xu, Baochun Li, and Gong Zhang. 2022. DeepQueueNet: towards scalable and generalized network performance estimation with packet-level visibility. In Proceedings of the ACM Special Interest Group on Data Communication (SIGCOMM). 441--457.
[54]
Qizhen Zhang, Kelvin K. W. Ng, Charles Kazer, Shen Yan, João Sedoc, and Vincent Liu. 2021. MimicNet: fast performance estimates for data center networks with machine learning. In Proceedings of the ACM Special Interest Group on Data Communication (SIGCOMM). 287--304.
[55]
Kevin Zhao, Prateesh Goyal, Mohammad Alizadeh, and Thomas E Anderson. 2023. Scalable Tail Latency Estimation for Data Center Networks. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI). 685--702.
[56]
Yibo Zhu, Haggai Eran, Daniel Firestone, ChaunXiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and Ming Zhang. 2015. Congestion Control for Large-Scale RDMA Deployments. In Proceedings of the ACM Special Interest Group on Data Communication (SIGCOMM). 523--536.

Index Terms

  1. m3: Accurate Flow-Level Performance Estimation using Machine Learning

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 Conference
      August 2024
      1033 pages
      ISBN:9798400706141
      DOI:10.1145/3651890
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 04 August 2024

      Check for updates

      Badges

      Author Tags

      1. network simulation
      2. data center networks
      3. approximation
      4. machine learning
      5. network modeling

      Qualifiers

      • Research-article

      Funding Sources

      • NSF Career Award
      • DARPA FastNICs program
      • Cisco Research
      • the UW Center for the Future of Cloud Infrastructure

      Conference

      ACM SIGCOMM '24
      Sponsor:
      ACM SIGCOMM '24: ACM SIGCOMM 2024 Conference
      August 4 - 8, 2024
      NSW, Sydney, Australia

      Acceptance Rates

      Overall Acceptance Rate 462 of 3,389 submissions, 14%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 1,350
        Total Downloads
      • Downloads (Last 12 months)1,350
      • Downloads (Last 6 weeks)366
      Reflects downloads up to 09 Jan 2025

      Other Metrics

      Citations

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media