Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Dynamic Colocation Policies with Reinforcement Learning

Published: 04 March 2020 Publication History
  • Get Citation Alerts
  • Abstract

    We draw on reinforcement learning frameworks to design and implement an adaptive controller for managing resource contention. During runtime, the controller observes the dynamic system conditions and optimizes control policies that satisfy latency targets yet improve server utilization. We evaluate a physical prototype that guarantees 95th percentile latencies for a search engine and improves server utilization by up to 70%, compared to exclusively reserving servers for interactive services, for varied batch workloads in machine learning.

    References

    [1]
    Intel. (n.d.). Cache Monitoring Technology and Cache Allocation Technology—Intel. Retrieved November 17, 2017 from https://www.intel.com/content/www/us/en/communications/cache-monitoring-cache-allocation-technologies.html.
    [2]
    Paul Manage (2004). Cgroups—The Linux Kernel Archives. Retrieved November 17, 2017 from https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt.
    [3]
    Dominik Brodowski. (n.d.). CPUFreq Governor—The Linux Kernel Archives. Retrieved January 29, 2020 from https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt.
    [4]
    David Marshall. (1999). IPC:Shared Memory. Retrieved November 20, 2017 from https://users.cs.cf.ac.uk/dave/C/node27.html.
    [5]
    National Taiwan University. (n.d.). LIBSVM Data: Classification (Binary Class). Retrieved January 29, 2020 from https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html.
    [6]
    Apache Spark. (2018). MLlib. Retrieved November 17, 2017 from https://spark.apache.org/mllib/.
    [7]
    Marcello Restelli. (2015). Reinforcement Learning: Exploration vs Exploitation. Retrieved November 17, 2017 from http://home.deib.polimi.it/restelli/MyWebSite/pdf/rl5.pdf.
    [8]
    Alex Benik. (2013). The sorry state of server utilization and the impending post-hypervisor era. GigaOm. Retrieved January 29, 2020 from https://gigaom.com/2013/11/30/the-sorry-state-of-server-utilization-and-the-impending-post-hypervisor-era/.
    [9]
    Sander Adam, Lucian Busoniu, and Robert Babuska. 2012. Experience replay for real-time reinforcement learning control. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42, 2 (2012), 201--212.
    [10]
    Martin Allen and Phil Fritzsche. 2011. Reinforcement learning with adaptive Kanerva coding for Xpilot game AI. In Proceedings of the 2011 IEEE Congress of Evolutionary Computation (CEC’11). IEEE, Los Alamitos, CA, 1521--1528.
    [11]
    Luiz André Barroso and Urs Hölzle. 2007. The case for energy-proportional computing. Computer 40, 12 (2007), 33--37.
    [12]
    Sangyeun Cho and Lei Jin. 2006. Managing distributed, shared L2 caches through OS-level page allocation. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, Los Alamitos, CA, 455--468.
    [13]
    Ewa Deelman, James Blythe, Yolanda Gil, Carl Kesselman, Gaurang Mehta, Sonal Patil, Mei-Hui Su, Karan Vahi, and Miron Livny. 2004. Pegasus: Mapping scientific workflows onto the grid. In Grid Computing. Springer, 131--140.
    [14]
    Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoS-aware scheduling for heterogeneous datacenters. ACM SIGPLAN Notices 48 (2013), 77--88.
    [15]
    Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resource-efficient and QoS-aware cluster management. ACM SIGPLAN Notices 49 (2014), 127--144.
    [16]
    Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, et al. 2018. Applied machine learning at Facebook: A datacenter infrastructure perspective. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’18). IEEE, Los Alamitos, CA, 620--629.
    [17]
    Joseph L. Hellerstein, Yixin Diao, Sujay Parekh, and Dawn M. Tilbury. 2004. Feedback Control of Computing Systems. John Wiley 8 Sons.
    [18]
    Chang-Hong Hsu, Yunqi Zhang, Michael A. Laurenzano, David Meisner, Thomas Wenisch, Jason Mars, Lingjia Tang, and Ronald G. Dreslinski. 2015. Adrenaline: Pinpointing and reining in tail queries with quick voltage boosting. In Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, Los Alamitos, CA, 271--282.
    [19]
    Markus C. Huebscher and Julie A. McCann. 2008. A survey of autonomic computing? Degrees, models, and applications. ACM Computing Surveys 40, 3 (2008), 7.
    [20]
    Engin Ipek, Onur Mutlu, José F. Martínez, and Rich Caruana. 2008. Self-optimizing memory controllers: A reinforcement learning approach. In Proceedings of the 35th International Symposium on Computer Architecture (ISCA’08). IEEE, Los Alamitos, CA, 39--50.
    [21]
    Harshad Kasture, Davide B. Bartolini, Nathan Beckmann, and Daniel Sanchez. 2015. Rubik: Fast analytical power management for latency-critical systems. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, New York, NY, 598--610.
    [22]
    Harshad Kasture and Daniel Sanchez. 2016. Tailbench: A benchmark suite and evaluation methodology for latency-critical applications. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’16). IEEE, Los Alamitos, CA, 1--10.
    [23]
    Wei Liu, Ying Tan, and Qinru Qiu. 2010. Enhanced Q-learning algorithm for dynamic power management with performance constraint. In Proceedings of the Conference on Design, Automation, and Test in Europe. 602--605.
    [24]
    Qiuyun Llull, Songchun Fan, Seyed Majid Zahedi, and Benjamin C. Lee. 2017. Cooper: Task colocation with cooperative games. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’17). IEEE, Los Alamitos, CA, 421--432.
    [25]
    David Lo, Liqun Cheng, Rama Govindaraju, Luiz André Barroso, and Christos Kozyrakis. 2014. Towards energy proportionality for large-scale latency-critical workloads. ACM SIGARCH Computer Architecture News 42 (2014), 301--312.
    [26]
    David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. 2015. Heracles: Improving resource efficiency at scale. ACM SIGARCH Computer Architecture News 43 (2015), 450--462.
    [27]
    Tania Lorido-Botran, Jose Miguel-Alonso, and Jose A. Lozano. 2014. A review of auto-scaling techniques for elastic applications in cloud environments. Journal of Grid Computing 12, 4 (2014), 559--592.
    [28]
    Hongzi Mao, Mohammad Alizadeh, Ishai Menache, and Srikanth Kandula. (2016). Resource management with deep reinforcement learning. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks (HotNets’16). 50--56.
    [29]
    Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. 2011. Bubble-Up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, New York, NY, 248--259.
    [30]
    Dirk Merkel. 2014. Docker: Lightweight Linux containers for consistent development and deployment. Linux Journal 2014, 239 (2014), 2.
    [31]
    Ripal Nathuji, Aman Kansal, and Alireza Ghaffarkhah. 2010. Q-Clouds: Managing performance interference effects for QoS-aware clouds. In Proceedings of the 5th European Conference on Computer Systems. ACM, New York, NY, 237--250.
    [32]
    Rajiv Nishtala, Paul Carpenter, Vinicius Petrucci, and Xavier Martorell. 2017. Hipster: Hybrid task manager for latency-critical cloud workloads. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’17). IEEE, Los Alamitos, CA, 409--420.
    [33]
    Dejan Novakovic, Nedeljko Vasic, Stanko Novakovic, Dejan Kostic, and Ricardo Bianchini. 2013. DeepDive: Transparently identifying and managing performance interference in virtualized environments. In Proceedings of the 2013 USENIX Annual Technical Conference.
    [34]
    Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. 2013. Sparrow: Distributed, low latency scheduling. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. ACM, New York, NY, 69--84.
    [35]
    Vinicius Petrucci, Michael A. Laurenzano, John Doherty, Yunqi Zhang, Daniel Mosse, Jason Mars, and Lingjia Tang. 2015. Octopus-Man: QoS-driven task management for heterogeneous multicores in warehouse-scale computers. In Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, Los Alamitos, CA, 246--258.
    [36]
    Moinuddin K. Qureshi and Yale N. Patt. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-39). IEEE, Los Alamitos, CA, 423--432.
    [37]
    Daniel Sanchez and Christos Kozyrakis. 2011. Vantage: Scalable and efficient fine-grain cache partitioning. ACM SIGARCH Computer Architecture News 39 (2011), 57--68.
    [38]
    Gerald Tesauro. 2007. Reinforcement learning in autonomic computing: A manifesto and case studies. IEEE Internet Computing 11, 1 (2007), 22--30.
    [39]
    Gerald Tesauro. 2005. Online resource allocation using decompositional reinforcement learning. In Proceedings of the 20th National Conference on Artificial Intelligence (AAAI’05), Vol. 2. 886--891.
    [40]
    Michel Tokic and Günther Palm. 2011. Value-difference based exploration: Adaptive control between epsilon-greedy and softmax. KI 2011: Advances in Artificial Intelligence. Lecture Notes in Computer Science, Vol. 7006. Springer, 335--346.
    [41]
    Balajee Vamanan, Hamza Bin Sohail, Jahangir Hasan, and T. N. Vijaykumar. 2015. TimeTrader: Exploiting latency tail to save datacenter energy for online search. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, New York, NY, 585--597.
    [42]
    Hailong Yang, Alex Breslow, Jason Mars, and Lingjia Tang. 2013. Bubble-Flux: Precise online QoS management for increased utilization in warehouse scale computers. ACM SIGARCH Computer Architecture News 41 (2013), 607--618.
    [43]
    Jingling Yuan, Xing Jiang, Luo Zhong, and Hui Yu. 2012. Energy aware resource scheduling algorithm for data center using reinforcement learning. In Proceedings of the 5th International Conference on Intelligent Computation Technology and Automation (ICICTA’12). IEEE, Los Alamitos, CA, 435--438.
    [44]
    Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, Vrigo Gokhale, and John Wilkes. 2013. CPI 2: CPU performance isolation for shared compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems. ACM, New York, NY, 379--391.
    [45]
    Yunqi Zhang, Michael A. Laurenzano, Jason Mars, and Lingjia Tang. 2014. Smite: Precise QoS prediction on real-system SMT processors to improve utilization in warehouse scale computers. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE, Los Alamitos, CA, 406--418.
    [46]
    Haishan Zhu and Mattan Erez. 2016. Dirigent: Enforcing QoS for latency-critical tasks on shared multicore systems. ACM SIGPLAN Notices 51, 4 (2016), 33--47.
    [47]
    Amirhossein Mirhosseini, Akshitha Sriraman, and Thomas F. Wenisch. 2019. Enhancing server efficiency in the face of killer microseconds. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA'19). IEEE, 185--198.
    [48]
    Chih-Hsun Chou, Laxmi N. Bhuyan, and Daniel Wong. 2019. μdpm: Dynamic power management for the microsecond era. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA'19). IEEE, 120--132.

    Cited By

    View all
    • (2024)Rethinking the Person Localization for Single-Stage Multi-Person Pose EstimationIEEE Transactions on Multimedia10.1109/TMM.2023.328213926(1436-1447)Online publication date: 1-Jan-2024
    • (2024)Dynamic link utilization empowered by reinforcement learning for adaptive storage allocation in MANETSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-023-09281-828:6(5275-5285)Online publication date: 1-Mar-2024
    • (2023)Resource-aware multi-task offloading and dependency-aware scheduling for integrated edge-enabled IoVJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2023.102923141:COnline publication date: 1-Aug-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 17, Issue 1
    March 2020
    206 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/3386454
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 March 2020
    Accepted: 01 December 2019
    Revised: 01 October 2019
    Received: 01 May 2019
    Published in TACO Volume 17, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Resource contention
    2. adaptive control
    3. machine learning

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)170
    • Downloads (Last 6 weeks)20
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Rethinking the Person Localization for Single-Stage Multi-Person Pose EstimationIEEE Transactions on Multimedia10.1109/TMM.2023.328213926(1436-1447)Online publication date: 1-Jan-2024
    • (2024)Dynamic link utilization empowered by reinforcement learning for adaptive storage allocation in MANETSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-023-09281-828:6(5275-5285)Online publication date: 1-Mar-2024
    • (2023)Resource-aware multi-task offloading and dependency-aware scheduling for integrated edge-enabled IoVJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2023.102923141:COnline publication date: 1-Aug-2023
    • (2023)EdgeDronesJournal of Network and Computer Applications10.1016/j.jnca.2023.103632215:COnline publication date: 24-May-2023
    • (2022)AirEdge: A Dependency-Aware Multi-Task Orchestration in Federated Aerial ComputingIEEE Transactions on Vehicular Technology10.1109/TVT.2021.312701171:1(805-819)Online publication date: Jan-2022
    • (2021)Analytical and Numerical Evaluation of Co-Scheduling Strategies and Their ApplicationComputers10.3390/computers1010012210:10(122)Online publication date: 2-Oct-2021
    • (2021)ParsloProceedings of the ACM Symposium on Cloud Computing10.1145/3472883.3486985(442-457)Online publication date: 1-Nov-2021
    • (2021)Dynamic Resources Allocation among Collocated Applications via Reinforcement Learning2021 IEEE 6th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA)10.1109/ICCCBDA51879.2021.9442553(323-331)Online publication date: 24-Apr-2021
    • (2021)A HPC Co-scheduler with Reinforcement LearningJob Scheduling Strategies for Parallel Processing10.1007/978-3-030-88224-2_7(126-148)Online publication date: 21-May-2021
    • (2021)An Analytical Bound for Choosing Trivial Strategies in Co-schedulingComputational Science and Its Applications – ICCSA 202110.1007/978-3-030-87010-2_28(381-395)Online publication date: 13-Sep-2021
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media