research-article

Open access

Dynamic Colocation Policies with Reinforcement Learning

Authors:

Benjamin C. LeeAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 17, Issue 1

Article No.: 1, Pages 1 - 25

https://doi.org/10.1145/3375714

Published: 04 March 2020 Publication History

All formats PDF

Abstract

We draw on reinforcement learning frameworks to design and implement an adaptive controller for managing resource contention. During runtime, the controller observes the dynamic system conditions and optimizes control policies that satisfy latency targets yet improve server utilization. We evaluate a physical prototype that guarantees 95th percentile latencies for a search engine and improves server utilization by up to 70%, compared to exclusively reserving servers for interactive services, for varied batch workloads in machine learning.

References

[1]

Intel. (n.d.). Cache Monitoring Technology and Cache Allocation Technology—Intel. Retrieved November 17, 2017 from https://www.intel.com/content/www/us/en/communications/cache-monitoring-cache-allocation-technologies.html.

[2]

Paul Manage (2004). Cgroups—The Linux Kernel Archives. Retrieved November 17, 2017 from https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt.

[3]

Dominik Brodowski. (n.d.). CPUFreq Governor—The Linux Kernel Archives. Retrieved January 29, 2020 from https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt.

[4]

David Marshall. (1999). IPC:Shared Memory. Retrieved November 20, 2017 from https://users.cs.cf.ac.uk/dave/C/node27.html.

[5]

National Taiwan University. (n.d.). LIBSVM Data: Classification (Binary Class). Retrieved January 29, 2020 from https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html.

[6]

Apache Spark. (2018). MLlib. Retrieved November 17, 2017 from https://spark.apache.org/mllib/.

[7]

Marcello Restelli. (2015). Reinforcement Learning: Exploration vs Exploitation. Retrieved November 17, 2017 from http://home.deib.polimi.it/restelli/MyWebSite/pdf/rl5.pdf.

[8]

Alex Benik. (2013). The sorry state of server utilization and the impending post-hypervisor era. GigaOm. Retrieved January 29, 2020 from https://gigaom.com/2013/11/30/the-sorry-state-of-server-utilization-and-the-impending-post-hypervisor-era/.

[9]

Sander Adam, Lucian Busoniu, and Robert Babuska. 2012. Experience replay for real-time reinforcement learning control. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42, 2 (2012), 201--212.

Digital Library

[10]

Martin Allen and Phil Fritzsche. 2011. Reinforcement learning with adaptive Kanerva coding for Xpilot game AI. In Proceedings of the 2011 IEEE Congress of Evolutionary Computation (CEC’11). IEEE, Los Alamitos, CA, 1521--1528.

[11]

Luiz André Barroso and Urs Hölzle. 2007. The case for energy-proportional computing. Computer 40, 12 (2007), 33--37.

Digital Library

[12]

Sangyeun Cho and Lei Jin. 2006. Managing distributed, shared L2 caches through OS-level page allocation. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, Los Alamitos, CA, 455--468.

Digital Library

[13]

Ewa Deelman, James Blythe, Yolanda Gil, Carl Kesselman, Gaurang Mehta, Sonal Patil, Mei-Hui Su, Karan Vahi, and Miron Livny. 2004. Pegasus: Mapping scientific workflows onto the grid. In Grid Computing. Springer, 131--140.

[14]

Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoS-aware scheduling for heterogeneous datacenters. ACM SIGPLAN Notices 48 (2013), 77--88.

Digital Library

[15]

Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resource-efficient and QoS-aware cluster management. ACM SIGPLAN Notices 49 (2014), 127--144.

Digital Library

[16]

Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, et al. 2018. Applied machine learning at Facebook: A datacenter infrastructure perspective. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’18). IEEE, Los Alamitos, CA, 620--629.

[17]

Joseph L. Hellerstein, Yixin Diao, Sujay Parekh, and Dawn M. Tilbury. 2004. Feedback Control of Computing Systems. John Wiley 8 Sons.

[18]

Chang-Hong Hsu, Yunqi Zhang, Michael A. Laurenzano, David Meisner, Thomas Wenisch, Jason Mars, Lingjia Tang, and Ronald G. Dreslinski. 2015. Adrenaline: Pinpointing and reining in tail queries with quick voltage boosting. In Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, Los Alamitos, CA, 271--282.

[19]

Markus C. Huebscher and Julie A. McCann. 2008. A survey of autonomic computing? Degrees, models, and applications. ACM Computing Surveys 40, 3 (2008), 7.

Digital Library

[20]

Engin Ipek, Onur Mutlu, José F. Martínez, and Rich Caruana. 2008. Self-optimizing memory controllers: A reinforcement learning approach. In Proceedings of the 35th International Symposium on Computer Architecture (ISCA’08). IEEE, Los Alamitos, CA, 39--50.

Digital Library

[21]

Harshad Kasture, Davide B. Bartolini, Nathan Beckmann, and Daniel Sanchez. 2015. Rubik: Fast analytical power management for latency-critical systems. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, New York, NY, 598--610.

Digital Library

[22]

Harshad Kasture and Daniel Sanchez. 2016. Tailbench: A benchmark suite and evaluation methodology for latency-critical applications. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’16). IEEE, Los Alamitos, CA, 1--10.

[23]

Wei Liu, Ying Tan, and Qinru Qiu. 2010. Enhanced Q-learning algorithm for dynamic power management with performance constraint. In Proceedings of the Conference on Design, Automation, and Test in Europe. 602--605.

Digital Library

[24]

Qiuyun Llull, Songchun Fan, Seyed Majid Zahedi, and Benjamin C. Lee. 2017. Cooper: Task colocation with cooperative games. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’17). IEEE, Los Alamitos, CA, 421--432.

[25]

David Lo, Liqun Cheng, Rama Govindaraju, Luiz André Barroso, and Christos Kozyrakis. 2014. Towards energy proportionality for large-scale latency-critical workloads. ACM SIGARCH Computer Architecture News 42 (2014), 301--312.

Digital Library

[26]

David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. 2015. Heracles: Improving resource efficiency at scale. ACM SIGARCH Computer Architecture News 43 (2015), 450--462.

Digital Library

[27]

Tania Lorido-Botran, Jose Miguel-Alonso, and Jose A. Lozano. 2014. A review of auto-scaling techniques for elastic applications in cloud environments. Journal of Grid Computing 12, 4 (2014), 559--592.

Digital Library

[28]

Hongzi Mao, Mohammad Alizadeh, Ishai Menache, and Srikanth Kandula. (2016). Resource management with deep reinforcement learning. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks (HotNets’16). 50--56.

Digital Library

[29]

Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. 2011. Bubble-Up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, New York, NY, 248--259.

Digital Library

[30]

Dirk Merkel. 2014. Docker: Lightweight Linux containers for consistent development and deployment. Linux Journal 2014, 239 (2014), 2.

Digital Library

[31]

Ripal Nathuji, Aman Kansal, and Alireza Ghaffarkhah. 2010. Q-Clouds: Managing performance interference effects for QoS-aware clouds. In Proceedings of the 5th European Conference on Computer Systems. ACM, New York, NY, 237--250.

Digital Library

[32]

Rajiv Nishtala, Paul Carpenter, Vinicius Petrucci, and Xavier Martorell. 2017. Hipster: Hybrid task manager for latency-critical cloud workloads. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’17). IEEE, Los Alamitos, CA, 409--420.

[33]

Dejan Novakovic, Nedeljko Vasic, Stanko Novakovic, Dejan Kostic, and Ricardo Bianchini. 2013. DeepDive: Transparently identifying and managing performance interference in virtualized environments. In Proceedings of the 2013 USENIX Annual Technical Conference.

[34]

Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. 2013. Sparrow: Distributed, low latency scheduling. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. ACM, New York, NY, 69--84.

Digital Library

[35]

Vinicius Petrucci, Michael A. Laurenzano, John Doherty, Yunqi Zhang, Daniel Mosse, Jason Mars, and Lingjia Tang. 2015. Octopus-Man: QoS-driven task management for heterogeneous multicores in warehouse-scale computers. In Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, Los Alamitos, CA, 246--258.

[36]

Moinuddin K. Qureshi and Yale N. Patt. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-39). IEEE, Los Alamitos, CA, 423--432.

[37]

Daniel Sanchez and Christos Kozyrakis. 2011. Vantage: Scalable and efficient fine-grain cache partitioning. ACM SIGARCH Computer Architecture News 39 (2011), 57--68.

Digital Library

[38]

Gerald Tesauro. 2007. Reinforcement learning in autonomic computing: A manifesto and case studies. IEEE Internet Computing 11, 1 (2007), 22--30.

Digital Library

[39]

Gerald Tesauro. 2005. Online resource allocation using decompositional reinforcement learning. In Proceedings of the 20th National Conference on Artificial Intelligence (AAAI’05), Vol. 2. 886--891.

[40]

Michel Tokic and Günther Palm. 2011. Value-difference based exploration: Adaptive control between epsilon-greedy and softmax. KI 2011: Advances in Artificial Intelligence. Lecture Notes in Computer Science, Vol. 7006. Springer, 335--346.

[41]

Balajee Vamanan, Hamza Bin Sohail, Jahangir Hasan, and T. N. Vijaykumar. 2015. TimeTrader: Exploiting latency tail to save datacenter energy for online search. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, New York, NY, 585--597.

[42]

Hailong Yang, Alex Breslow, Jason Mars, and Lingjia Tang. 2013. Bubble-Flux: Precise online QoS management for increased utilization in warehouse scale computers. ACM SIGARCH Computer Architecture News 41 (2013), 607--618.

Digital Library

[43]

Jingling Yuan, Xing Jiang, Luo Zhong, and Hui Yu. 2012. Energy aware resource scheduling algorithm for data center using reinforcement learning. In Proceedings of the 5th International Conference on Intelligent Computation Technology and Automation (ICICTA’12). IEEE, Los Alamitos, CA, 435--438.

Digital Library

[44]

Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, Vrigo Gokhale, and John Wilkes. 2013. CPI 2: CPU performance isolation for shared compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems. ACM, New York, NY, 379--391.

Digital Library

[45]

Yunqi Zhang, Michael A. Laurenzano, Jason Mars, and Lingjia Tang. 2014. Smite: Precise QoS prediction on real-system SMT processors to improve utilization in warehouse scale computers. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE, Los Alamitos, CA, 406--418.

Digital Library

[46]

Haishan Zhu and Mattan Erez. 2016. Dirigent: Enforcing QoS for latency-critical tasks on shared multicore systems. ACM SIGPLAN Notices 51, 4 (2016), 33--47.

Digital Library

[47]

Amirhossein Mirhosseini, Akshitha Sriraman, and Thomas F. Wenisch. 2019. Enhancing server efficiency in the face of killer microseconds. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA'19). IEEE, 185--198.

[48]

Chih-Hsun Chou, Laxmi N. Bhuyan, and Daniel Wong. 2019. μdpm: Dynamic power management for the microsecond era. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA'19). IEEE, 120--132.

Cited By

Jin LWang XNie XWang WGuo YYan SZhao J(2024)Rethinking the Person Localization for Single-Stage Multi-Person Pose EstimationIEEE Transactions on Multimedia10.1109/TMM.2023.328213926(1436-1447)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3282139
Anand RSenthilkumar VKumar GRajendran ARajaram A(2024)Dynamic link utilization empowered by reinforcement learning for adaptive storage allocation in MANETSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-023-09281-828:6(5275-5285)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s00500-023-09281-8
Awada UZhang JChen SLi SYang S(2023)Resource-aware multi-task offloading and dependency-aware scheduling for integrated edge-enabled IoVJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2023.102923141:COnline publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1016/j.sysarc.2023.102923
Show More Cited By

Index Terms

Dynamic Colocation Policies with Reinforcement Learning

Recommendations

Urgent Virtual Machine Eviction with Enlightened Post-Copy
VEE '16

Virtual machine (VM) migration demands distinct properties under resource oversubscription and workload surges. We present enlightened post-copy, a new mechanism for VMs under contention that evicts the target VM with fast execution transfer and short ...
Urgent Virtual Machine Eviction with Enlightened Post-Copy
VEE '16: Proceedings of the12th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments

Virtual machine (VM) migration demands distinct properties under resource oversubscription and workload surges. We present enlightened post-copy, a new mechanism for VMs under contention that evicts the target VM with fast execution transfer and short ...
Reinforcement Learning for UAV Attitude Control

Autopilot systems are typically composed of an “inner loop” providing stability and control, whereas an “outer loop” is responsible for mission-level objectives, such as way-point navigation. Autopilot systems for unmanned aerial vehicles are ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 17, Issue 1

March 2020

206 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3386454

Editor:
Koen De Bosschere
Ghent University, Belgium

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 March 2020

Accepted: 01 December 2019

Revised: 01 October 2019

Received: 01 May 2019

Published in TACO Volume 17, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
1,312
Total Downloads

Downloads (Last 12 months)170
Downloads (Last 6 weeks)20

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jin LWang XNie XWang WGuo YYan SZhao J(2024)Rethinking the Person Localization for Single-Stage Multi-Person Pose EstimationIEEE Transactions on Multimedia10.1109/TMM.2023.328213926(1436-1447)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3282139
Anand RSenthilkumar VKumar GRajendran ARajaram A(2024)Dynamic link utilization empowered by reinforcement learning for adaptive storage allocation in MANETSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-023-09281-828:6(5275-5285)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s00500-023-09281-8
Awada UZhang JChen SLi SYang S(2023)Resource-aware multi-task offloading and dependency-aware scheduling for integrated edge-enabled IoVJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2023.102923141:COnline publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1016/j.sysarc.2023.102923
Awada UZhang JChen SLi SYang S(2023)EdgeDronesJournal of Network and Computer Applications10.1016/j.jnca.2023.103632215:COnline publication date: 24-May-2023
https://dl.acm.org/doi/10.1016/j.jnca.2023.103632
Awada UZhang JChen SLi S(2022)AirEdge: A Dependency-Aware Multi-Task Orchestration in Federated Aerial ComputingIEEE Transactions on Vehicular Technology10.1109/TVT.2021.312701171:1(805-819)Online publication date: Jan-2022
https://doi.org/10.1109/TVT.2021.3127011
Kuchumov RKorkhov V(2021)Analytical and Numerical Evaluation of Co-Scheduling Strategies and Their ApplicationComputers10.3390/computers1010012210:10(122)Online publication date: 2-Oct-2021
https://doi.org/10.3390/computers10100122
Mirhosseini AElnikety SWenisch T(2021)ParsloProceedings of the ACM Symposium on Cloud Computing10.1145/3472883.3486985(442-457)Online publication date: 1-Nov-2021
https://dl.acm.org/doi/10.1145/3472883.3486985
Li SLi GSato H(2021)Dynamic Resources Allocation among Collocated Applications via Reinforcement Learning2021 IEEE 6th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA)10.1109/ICCCBDA51879.2021.9442553(323-331)Online publication date: 24-Apr-2021
https://doi.org/10.1109/ICCCBDA51879.2021.9442553
Souza APelckmans KTordsson J(2021)A HPC Co-scheduler with Reinforcement LearningJob Scheduling Strategies for Parallel Processing10.1007/978-3-030-88224-2_7(126-148)Online publication date: 21-May-2021
https://dl.acm.org/doi/10.1007/978-3-030-88224-2_7
Kuchumov RKorkhov V(2021)An Analytical Bound for Choosing Trivial Strategies in Co-schedulingComputational Science and Its Applications – ICCSA 202110.1007/978-3-030-87010-2_28(381-395)Online publication date: 13-Sep-2021
https://dl.acm.org/doi/10.1007/978-3-030-87010-2_28
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents