Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3673038.3673144acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article
Open access

TESLA: Thermally Safe, Load-Aware, and Energy-Efficient Cooling Control System for Data Centers

Published: 12 August 2024 Publication History

Abstract

The increasing demand for artificial intelligence and cloud computing has led to skyrocketing energy consumption of data centers (DCs). This paper focuses on tackling this energy challenge through cooling control system optimization, which aims to ensure thermal safety with minimal cooling energy consumption. Current industry practice involves human operators, while many data-driven methods have also been proposed. However, human intervention often results in unnecessary energy consumption, particularly in the face of fluctuating server loads, whereas existing data-driven methods struggle to maintain thermal safety in practice. To overcome these issues, we propose TESLA, a thermally safe, load-aware, and energy-efficient cooling control system for data centers. TESLA employs a novel data-driven framework that integrates domain knowledge to predict DC temperature and cooling energy under dynamic server load. Based on these predictions, a Bayesian optimizer (BO) finds the energy-optimal settings for the cooling system at every control step. Besides cooling energy, BO’s optimization objective also includes minimizing cooling interruption that causes rapid temperature rise within the data center and leads to thermal safety violations. We deploy TESLA on a real data-center testbed and show that it achieves on average <Formula format="inline"><TexMath><?TeX $10.1\%$?></TexMath><AltText>Math 1</AltText><File name="icpp24-106-inline1" type="svg"/></Formula> cooling energy saving relative to a fixed cooling system parameter setting and no thermal safety violation relative to previous data-driven methods.

References

[1]
Abdul Afram and Farrokh Janabi-Sharifi. 2014. Theory and applications of HVAC control systems–A review of model predictive control (MPC). Building and Environment 72 (2014), 343–355.
[2]
Joaquin Amat Rodrigo and Javier Escobar Ortiz. 2023. skforecast. (11 2023). https://doi.org/10.5281/zenodo.8382788
[3]
ASHARE. 2024. The American Society of Heating, Refrigerating and Air-Conditioning Engineers. (Jan 2024). https://www.ashrae.org
[4]
Maximilian Balandat, Brian Karrer, Daniel R. Jiang, Samuel Daulton, Benjamin Letham, Andrew Gordon Wilson, and Eytan Bakshy. 2020. BoTorch: A Framework for Efficient Monte-Carlo Bayesian Optimization. In Advances in Neural Information Processing Systems 33.
[5]
Zhiwei Cao, Ruihang Wang, Xin Zhou, and Yonggang Wen. 2023. Toward Model-Assisted Safe Reinforcement Learning for Data Center Cooling Control: A Lyapunov-based Approach. In Proceedings of the 14th ACM International Conference on Future Energy Systems. 333–346.
[6]
Gong Chen, Wenbo He, Jie Liu, Suman Nath, Leonidas Rigas, Lin Xiao, and Feng Zhao. 2008. Energy-Aware Server Provisioning and Load Dispatching for Connection-Intensive Internet Services. In Proceedings of the 5th USENIX Symposium on Networked Systems Design & Implementation (NSDI’08). San Francisco, CA, USA.
[7]
Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD ’16). ACM. https://doi.org/10.1145/2939672.2939785
[8]
Peng Cheng, Xianyuan Zhan, zhihao wu, Wenjia Zhang, Youfang Lin, Shou cheng Song, Han Wang, and Li Jiang. 2023. Look Beneath the Surface: Exploiting Fundamental Symmetry for Sample-Efficient Offline RL. In Advances in Neural Information Processing Systems, A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.). Vol. 36. Curran Associates, Inc., 7612–7631. https://proceedings.neurips.cc/paper_files/paper/2023/file/181a027913d36bc0a8857c0da661d621-Paper-Conference.pdf
[9]
Paramveer S Dhillon, Dean P Foster, Sham M Kakade, and Lyle H Ungar. 2013. A risk comparison of ordinary least squares vs ridge regression. The Journal of Machine Learning Research 14, 1 (2013), 1505–1511.
[10]
Haiyan Ding. 2024. Cluster data collected from production clusters in Alibaba for cluster management research. (Jan 2024). https://github.com/alibaba/clusterdata
[11]
[11] Envicool. 2024. (Jan2024). https://www.envicool.net/
[12]
Tom Bawden et al.2016. Global Warming: Data Centres to Consume Three Times as Much Energy in Next Decade, Experts Warn. The Independent 23 (2016).
[13]
Peter I. Frazier. 2018. A Tutorial on Bayesian Optimization. (2018). arxiv:stat.ML/1807.02811
[14]
Carlucci Gaetano. 2024. CPU Load Generator. (Jan 2024). https://github.com/GaetanoCarlucci/CPULoadGenerator
[15]
Jacob R. Gardner, Geoff Pleiss, David Bindel, Kilian Q. Weinberger, and Andrew Gordon Wilson. 2018. Gpytorch: Blackbox matrix-matrix gaussian process inference with GPU acceleration. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. Curran Associates Inc., 7587–7597.
[16]
Rob J Hyndman and George Athanasopoulos. 2021. Forecasting: Principles and Practice (3rd ed.). OTexts, Melbourne, Australia. https://otexts.com/fpp3
[17]
Hemant Kumar. 2024. Jobs. (Jan 2024). https://kubernetes.io/docs/concepts/workloads/controllers/job/
[18]
Hemant Kumar. 2024. kubernetes/kubernetes: Production-Grade Container Scheduling and Management. (Jan 2024). https://github.com/kubernetes/kubernetes
[19]
Alok Gautam Kumbhare, Reza Azimi, Ioannis Manousakis, Anand Bonde, Felipe Frujeri, Nithish Mahalingam, Pulkit A. Misra, Seyyed Ahmad Javadi, Bianca Schroeder, Marcus Fontoura, and Ricardo Bianchini. 2021. Prediction-Based Power Oversubscription in Cloud Platforms. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). USENIX Association, 473–487. https://www.usenix.org/conference/atc21/presentation/kumbhare
[20]
Nevena Lazic, Craig Boutilier, Tyler Lu, Eehern Wong, Binz Roy, MK Ryu, and Greg Imwalle. 2018. Data Center Cooling Using Model-Predictive Control. In Advances in Neural Information Processing Systems 31 (NeurIPS 2018).
[21]
Benjamin Letham, Brian Karrer, Guilherme Ottoni, and Eytan Bakshy. 2019. Constrained Bayesian optimization with noisy experiments. (2019).
[22]
Shaohong Li, Xi Wang, Xiao Zhang, Vasileios Kontorinis, Sreekumar Kodakara, David Lo, and Parthasarathy Ranganathan. 2020. Thunderbolt: Throughput-Optimized, Quality-of-Service-Aware Power Capping at Scale. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 1241–1255. https://www.usenix.org/conference/osdi20/presentation/li-shaohong
[23]
Yuanlong Li, Yonggang Wen, Dacheng Tao, and Kyle Guan. 2019. Transforming Cooling Optimization for Green Data Center via Deep Reinforcement Learning. IEEE Transactions on Cybernetics 50, 5 (2019), 2002–2013.
[24]
Chieh-Jan Liang, Jie Liu, Liqian Luo, Andreas Terzis, and Feng Zhao. 2009. RACNet: A High-Fidelity Data Center Sensing Network. In Proceedings of the ACM Conference on Embedded Networked Sensor Systems (SenSys 2009).
[25]
Sifan Liu and Edgar Dobriban. 2020. Ridge Regression: Structure, Cross-Validation, and Sketching. (2020). arxiv:math.ST/1910.02373
[26]
Gilles Louppe. 2015. Understanding Random Forests: From Theory to Practice. (2015). arxiv:stat.ML/1407.7502
[27]
Qi Mao, Yong Xu, Jianqi Chen, Jie Chen, and Tryphon Georgiou. 2023. Classical Stability Margins by PID Control. (2023). arxiv:math.OC/2311.11460
[28]
Meta. 2024. PyTorch. (Jan 2024). https://pytorch.org
[29]
Takao Moriyama, Giovanni De Magistris, Michiaki Tatsubori, Tu-Hoa Pham, Asim Munawar, and Ryuki Tachibana. 2018. Reinforcement Learning Testbed for Power-Consumption Optimization. In Methods and Applications for Modeling and Simulation of Complex Systems: 18th Asia Simulation Conference, AsiaSim 2018. Springer, 45–59.
[30]
Theodosis Moumiadis. 2024. Tier 4 Data Center Cooling System Design. (Jan 2024). http://moumiadis.blogspot.com/2019/03/tier-4-data-center-cooling-system-design.html
[31]
Daniel Nelson. 2024. Get InfluxDB: #1 Ranked Time Series Database. (Jan 2024). https://www.influxdata.com/get-influxdb/
[32]
Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. 2023. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. In The Eleventh International Conference on Learning Representations.
[33]
Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Martin Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12, 85 (2011), 2825–2830.
[34]
Yongyi Ran, Han Hu, Yonggang Wen, and Xin Zhou. 2022. Optimizing energy efficiency for data center via parameterized deep reinforcement learning. IEEE Transactions on Services Computing 16, 2 (2022), 1310–1323.
[35]
Yongyi Ran, Han Hu, Xin Zhou, and Yonggang Wen. 2019. Deepee: Joint optimization of job scheduling and cooling control for data center energy efficiency using deep reinforcement learning. In 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS). IEEE, 645–655.
[36]
Yongyi Ran, Xin Zhou, Han Hu, and Yonggang Wen. 2022. Optimizing Data Center Energy Efficiency via Event-Driven Deep Reinforcement Learning. IEEE Transactions on Services Computing 16, 2 (2022), 1296–1309.
[37]
Carl Edward Rasmussen and Christopher K. I. Williams. 2005. Gaussian Processes for Machine Learning. The MIT Press.
[38]
Md. Shiblee, P. K. Kalra, and B. Chandra. 2009. Time Series Prediction with Multilayer Perceptron (MLP): A New Generalized Error Based Approach. In Advances in Neuro-Information Processing, Mario Köppen, Nikola Kasabov, and George Coghill (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 37–44.
[39]
Skforecast. 2024. Skforecast: Probabilistic Forecasting. (Jan 2024). https://skforecast.org/0.11.0/user_guides/probabilistic-forecasting
[40]
Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. 2012. Practical Bayesian Optimization of Machine Learning Algorithms. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 2(NIPS ’12). Curran Associates Inc., 2951–2959.
[41]
J. Stojkovic, N. Iliakopoulou, T. Xu, H. Franke, and J. Torrellas. 2024. EcoFaaS: Rethinking the Design of Serverless Environments for Energy Efficiency. In Proceedings of the 51st International Symposium on Computer Architecture (ISCA). To Appear.
[42]
Ruihang Wang, Zhiwei Cao, Xin Zhou, Yonggang Wen, and Rui Tan. 2023. Phyllis: Physics-Informed Lifelong Reinforcement Learning for Data Center Cooling Control. In Proceedings of the 14th ACM International Conference on Future Energy Systems. 114–126.
[43]
Ruihang Wang, Xinyi Zhang, Xin Zhou, Yonggang Wen, and Rui Tan. 2022. Toward physics-guided safe deep reinforcement learning for green data center cooling control. In 2022 ACM/IEEE 13th International Conference on Cyber-Physical Systems (ICCPS). IEEE, 159–169.
[44]
Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. 2021. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Advances in Neural Information Processing Systems 34 (2021), 22419–22430.
[45]
Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. 2023. Are Transformers Effective for Time Series Forecasting?Proceedings of the AAAI Conference on Artificial Intelligence.
[46]
Chi Zhang, Sanmukh R. Kuppannagari, Rajgopal Kannan, and Viktor K. Prasanna. 2019. Building HVAC Scheduling Using Reinforcement Learning via Neural Network Based Model Approximation. In Proceedings of the 6th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation. ACM, 287–296.
[47]
Lu Zhang, Chao Li, Xinkai Wang, Weiqi Feng, Zheng Yu, Quan Chen, Jingwen Leng, Minyi Guo, Pu Yang, and Shang Yue. 2023. First: Exploiting the multi-dimensional attributes of functions for power-aware serverless computing. In 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 864–874.
[48]
Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. 2022. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In International Conference on Machine Learning. PMLR, 27268–27286.
[49]
Zhuangzhuang Zhou, Yanqi Zhang, and Christina Delimitrou. 2023. AQUATOPE: QoS-and-Uncertainty-Aware Resource Management for Multi-stage Serverless Workflows. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems(ASPLOS ’23), Vol. 1. ACM, New York, NY, USA, 14. https://doi.org/10.1145/3567955.3567960

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing
August 2024
1279 pages
ISBN:9798400717932
DOI:10.1145/3673038
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2024

Check for updates

Author Tags

  1. Bayesian optimization
  2. Data centers
  3. cooling control optimization
  4. thermal safety
  5. time-series modeling

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ICPP '24

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 153
    Total Downloads
  • Downloads (Last 12 months)153
  • Downloads (Last 6 weeks)123
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media