Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3545008.3545074acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article
Open access

NCC: Neighbor-aware Congestion Control based on Reinforcement Learning for Datacenter Networks

Published: 13 January 2023 Publication History

Abstract

The challenges of low latency, high throughput datacenter networks create new traffic management problems that require new congestion control mechanisms. Generally, the proposals to solve this problem have focused either on refining existing window-based congestion control like in TCP or on introducing a central controller to make congestion control decisions. In this paper, we propose a third approach, where nodes share network information with their neighbors and apply this information to make local decisions that limit global congestion. In our implementation, the rate limiting decisions on one node are driven by the local agent that uses reinforcement learning to optimize a combination of overall latency, throughput and the shared information. To make this approach efficient, the local agents choose overall rate limits for each node, and then a separate process assigns the traffic of individual flows within these limits. We show that, in trace-driven real implementation, our method achieves better congestion avoidance than several end-to-end and centralized mechanisms in prior work.

References

[1]
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, and M. Isard. 2016. Tensorflow: a system for large-scale machine learning. In Proc. of OSDI.
[2]
M. Alizadeh, T. Edsall, S. Dharmapurikar, R. Vaidyanathan, K. Chu, A. Fingerhut, F. Matus, R. Pan, N. Yadav, and G. Varghese. 2014. CONGA: Distributed congestion-aware load balancing for datacenters. In Proc. of ACM SIGCOMM Computer Communication Review, Vol. 44. ACM, 503–514.
[3]
M. Alizadeh, A. Greenberg, D. Maltz, J. Padhye, P. Patel, B. Prabhakar, S. Sengupta, and M. Sridharan. 2011. Data center TCP (DCTCP). Proc. of SIGCOMM (2011).
[4]
Venkat Arun and Hari Balakrishnan. 2018. Copa: Practical delay-based congestion control for the internet. In Proc. of NSDI.
[5]
W. Bai, L. Chen, K. Chen, D. Han, C. Tian, and H. Wang. 2015. Information-agnostic flow scheduling for commodity data centers. In Proc. of NSDI.
[6]
Dimitri P Bertsekas, Robert G Gallager, and Pierre Humblet. 1992. Data networks. Vol. 2. Prentice-Hall International New Jersey.
[7]
Lawrence S Brakmo, Sean W O’Malley, and Larry L Peterson. 1994. TCP Vegas: New techniques for congestion detection and avoidance. In Proceedings of the conference on Communications architectures, protocols and applications. 24–35.
[8]
N. Cardwell, Y. Cheng, S. Gunn, and H. Yeganeh. 2017. BBR: congestion-based congestion control. Commun. ACM (2017).
[9]
L. Chen, J. Lingys, K. Chen, and F. Liu. 2018. Auto: Scaling deep reinforcement learning for datacenter-scale automatic traffic optimization. In Proc. of SIGCOMM.
[10]
Inho Cho, Keon Jang, and Dongsu Han. 2017. Credit-scheduled delay-bounded congestion control for datacenters. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication. 239–252.
[11]
[11] Congestion Control Plane(CCP) programming model.https://ccp-project.github.io/guide/intro.html, [accessed in July 2022].
[12]
[12] Facebook Network Analytics Data Sharing Group.https://research.fb.com/blog/2017/01/data-sharing-on-traffic-pattern-inside-facebooks-datacenter-network/, [accessed in July 2022].
[13]
J. Gao, Y. Shen, J. Liu, M. Ito, and N. Shiratori. 2017. Adaptive traffic signal control: Deep reinforcement learning algorithm with experience replay and target network. arXiv preprint arXiv:1705.02755(2017).
[14]
K. He, E. Rozner, K. Agarwal, W. Felter, J. Carter, and A. Akella. 2015. Presto: Edge-based load balancing for fast datacenter networks. ACM SIGCOMM Computer Communication Review 45, 4 (2015), 465–478.
[15]
S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang. 2010. The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In Proc. of ICDEW.
[16]
J. Hwang, J. Yoo, and N. Choi. 2012. IA-TCP: a rate based incast-avoidance algorithm for TCP in data center networks. In Proc. of ICC.
[17]
Rajendra K Jain, Dah-Ming W Chiu, William R Hawe, 1984. A quantitative measure of fairness and discrimination. Eastern Research Laboratory, Digital Equipment Corporation, Hudson, MA (1984).
[18]
A. Kabbani, B. Vamanan, J. Hasan, and F. Duchene. 2014. Flowbender: Flow-level adaptive routing for improved latency and throughput in datacenter networks. In Proc. of ICNET.
[19]
D. Kingma and J. Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980(2014).
[20]
V. Konda and J. Tsitsiklis. 2000. Actor-critic algorithms. In Advances in neural information processing systems.
[21]
Alok Kumar, Sushant Jain, Uday Naik, Anand Raghuraman, Nikhil Kasinadhuni, Enrique Cauich Zermeno, C Stephen Gunn, Jing Ai, Björn Carlin, Mihai Amarandei-Stavila, 2015. BwE: Flexible, hierarchical bandwidth allocation for WAN distributed computing. Proc. of SIGCOMM (2015).
[22]
H. Mao, R. Netravali, and M. Alizadeh. 2017. Neural adaptive video streaming with pensieve. In Proc. of SIGCOMM.
[23]
Radhika Mittal, Nandita Dukkipati, Emily Blem, Hassan Wassel, Monia Ghobadi, Amin Vahdat, Yaogong Wang, David Wetherall, David Zats, 2015. TIMELY: RTT-based Congestion Control for the Datacenter. In Proc. of SIGCOMM.
[24]
J. Perry, H. Balakrishnan, and D. Shah. 2017. Flowtune: Flowlet control for datacenter networks. In Proc. of NSDI.
[25]
J. Perry, A. Ousterhout, H. Balakrishnan, D. Shah, and H. Fugal. 2015. Fastpass: A centralized zero-queue datacenter network. Proc. of SIGCOMM (2015).
[26]
C. Raiciu, S. Barre, C. Pluntke, A. Greenhalgh, D. Wischik, and M. Handley. 2011. Improving datacenter performance and robustness with multipath TCP. In ACM SIGCOMM Computer Communication Review.
[27]
A. Roy, H. Zeng, J. Bagga, G. Porter, and A. Snoeren. 2015. Inside the social network’s (datacenter) network. In Proc. of SIGCOMM.
[28]
Brent Stephens, Alan L Cox, Ankit Singla, John Carter, Colin Dixon, and Wesley Felter. 2014. Practical DCB for improved data center networks. In Proc, of INFOCOM.
[29]
[29] tcpdump.https://www.tcpdump.org/, [accessed in July 2022].
[30]
B. Vamanan, J. Hasan, and T. Vijaykumar. 2012. Deadline-aware datacenter tcp (d2tcp). Proc. of SIGCOMM (2012).
[31]
B. Vattikonda, G. Porter, A. Vahdat, and A. Snoeren. 2012. Practical TDMA for Datacenter Ethernet. In Proc. of the 7th ACM european conference on Computer Systems (EuroSys).
[32]
Haoyu Wang, Zetian Liu, and Haiying Shen. 2020. Job scheduling for large-scale machine learning clusters. In Proceedings of the 16th International Conference on emerging Networking EXperiments and Technologies. 108–120.
[33]
H. Wang and H. Shen. 2018. Proactive incast congestion control in a datacenter serving web applications. In Proc. of INFOCOM.
[34]
H. Wang, H. Shen, and Z. Li. 2018. Approaches for resilience against cascading failures in cloud datacenters. In Proc. of ICDCS.
[35]
H. Wang, H. Shen, Z. Li, and S. Tian. 2021. GeoCol: A Geo-distributed Cloud Storage System with Low Cost and Latency using Reinforcement Learning. In Proc. of ICDCS.
[36]
H. Wang, H. Shen, and G. Liu. 2017. Swarm-based incast congestion control in datacenters serving web applications. In Proc. of SPAA.
[37]
Haoyu Wang, Haiying Shen, Charles Reiss, Arnim Jain, and Yunqiao Zhang. 2020. Improved intermediate data management for mapreduce frameworks. In Proc. of IPDPS.
[38]
P. Wang, H. Xu, Z. Niu, D. Han, and Y. Xiong. 2016. Expeditus: Congestion-aware load balancing in clos data center networks. In Proc. of ACM Symposium on Cloud Computing (SOCC).
[39]
H. Wu, Z. Feng, C. Guo, and Y. Zhang. 2013. ICTCP: Incast congestion control for TCP in data-center networks. Trans. on TON (2013).
[40]
[40] Yahoo Cloud Serving Benchmark (YCSB).https://github.com/brianfrankcooper/YCSB, [accessed in July 2022].
[41]
David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, and Randy Katz. 2012. DeTail: Reducing the flow completion time tail in datacenter networks. In Proc. of SIGCOMM. 139–150.

Cited By

View all
  • (2024)Comprehensive review on congestion detection, alleviation, and control for IoT networksJournal of Network and Computer Applications10.1016/j.jnca.2023.103749221:COnline publication date: 14-Mar-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '22: Proceedings of the 51st International Conference on Parallel Processing
August 2022
976 pages
ISBN:9781450397339
DOI:10.1145/3545008
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 January 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Congestion control
  2. Datacenter network
  3. Reinforcement learning

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • Microsoft Research Fellowship
  • CCF
  • NSF

Conference

ICPP '22
ICPP '22: 51st International Conference on Parallel Processing
August 29 - September 1, 2022
Bordeaux, France

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)178
  • Downloads (Last 6 weeks)14
Reflects downloads up to 18 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Comprehensive review on congestion detection, alleviation, and control for IoT networksJournal of Network and Computer Applications10.1016/j.jnca.2023.103749221:COnline publication date: 14-Mar-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media