Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Application and Thermal-reliability-aware Reinforcement Learning Based Multi-core Power Management

Published: 10 October 2019 Publication History

Abstract

Power management through dynamic voltage and frequency scaling (DVFS) is one of the most widely adopted techniques. However, it impacts application reliability (due to soft errors, circuit aging, and deadline misses). However, increased power density impacts the thermal reliability of the chip, sometimes leading to permanent failure. To balance both application- and thermal-reliability along with achieving power savings and maintaining performance, we propose application- and thermal-reliability-aware reinforcement learning–based multi-core power management in this work. The proposed power management scheme employs a reinforcement learner to consider the power savings and variations in the application and thermal reliability caused by DVFS. To overcome the computational overhead, the power management decisions are determined at the application-level rather than per-core or system-level granularity. Experimental evaluation of proposed multi-core power management on a microprocessor with up to 32 cores, running PARSEC applications, was done to demonstrate the applicability and efficiency of the proposed technique. Compared to the existing state-of-the-art techniques, the proposed technique enables an average energy savings of up to ∼20%, up to 4.926°C temperature reduction without degradation in the application- and thermal-reliability.

References

[1]
A. Bartolini, M. Cacciari, A. Tilli, and L. Benini. 2013. Thermal and energy management of high-performance multi-cores: Distributed and self-calibrating model-predictive controller. IEEE Trans. Parallel Distrib. Syst. 24, 1 (Jan. 2013), 170--183.
[2]
A. Bartolini et al. 2011. A distributed and self-calibrating model-predictive controller for energy and thermal management of high-performance multi-cores. In Proceedings of the Design, Automation and Test in Europe Conference (DATE’11).
[3]
Richard Ernest Bellman. 2003. Dynamic Programming. Dover Publications, Incorporated.
[4]
R. Bergamaschi et al. 2008. Exploring power management in multi-core systems. In Proceedings of the Asia and South Pacific Design Automation Conference.
[5]
Christian Bienia et al. 2008. The PARSEC Benchmark suite: Characterization and architectural implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques.
[6]
D. Brooks, R. P. Dick, R. Joseph, and L. Shang. 2007. Power, thermal, and reliability modeling in nanometer-scale microprocessors. IEEE Micro 27, 3 (May 2007), 49--62.
[7]
Trevor E. Carlson et al. 2014. An evaluation of high-level mechanistic core models. ACM Trans. Archit. Code Optim. 11, 3 (Aug. 2014), 28:1--28:25.
[8]
Kihwan Choi, Ramakrishna Soma, and Massoud Pedram. 2004. Dynamic voltage and frequency scaling based on workload decomposition. In Proceedings of the International Symposium on Low Power Electronics and Design.
[9]
Kihwan Choi, R. Soma, and M. Pedram. 2005. Fine-grained dynamic voltage and frequency scaling for precise energy and performance tradeoff based on the ratio of off-chip access to on-chip computation times. IEEE Trans. Comput.-Aided Des. Integr. Circ. Syst. 24, 1 (Jan. 2005), 18--28.
[10]
Foad Dabiri, Ani Nahapetian, Miodrag Potkonjak, and Majid Sarrafzadeh. 2007. Soft error-aware power optimization using gate sizing. In Integrated Circuit and System Design: Power and Timing Modeling, Optimization and Simulation (PATMOS'07), N. Azémard and L. Svensson (Eds.). Lecture Notes in Computer Science, Vol. 4644. Springer, Berlin, Heidelberg.
[11]
B. Dietrich et al. 2010. LMS-based low-complexity game workload prediction for DVFS. In Proceedings of the IEEE International Conference on Computer Design.
[12]
A. Ejlali, B. M. Al-Hashimi, and P. Eles. 2012. Low-energy standby-sparing for hard real-time systems. IEEE Trans. Comput.-Aided Des. Integr. Circ. Syst. 31, 3 (Mar. 2012), 329--342.
[13]
Hadi Esmaeilzadeh et al. 2011. Dark silicon and the end of multicore scaling. In Proceedings of the International Symposium on Computer Architecture.
[14]
D. Gnad, M. Shafique, F. Kriebel, S. Rehman, and J. Henkel. 2015. Hayat: Harnessing dark silicon and variability for aging deceleration and balancing. In Proceedings of the Design Automation Conference (DAC’15).
[15]
Wei Huang, Shougata Ghosh, Siva Velusamy, Karthik Sankaranarayanan, Kevin Skadron, and Mircea R. Stan. 2006. Hotspot: Acompact thermal modeling methodology for early-stage VLSI design. IEEE Trans. Very Large Scale Integr. Syst. 14, 5 (May 2006), 501--513.
[16]
H. Jung and M. Pedram. 2010. Supervised learning based power management for multicore processors. IEEE Trans. Comput.-Aided Des. Integr. Circ. Syst. 29, 9 (Sept. 2010), 1395--1408.
[17]
N. Kapadia and S. Pasricha. 2015. VARSHA: Variation and reliability-aware application scheduling with adaptive parallelism in the dark-silicon era. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE’15).
[18]
J. S. Lee, K. Skadron, and S. W. Chung. 2010. Predictive temperature-aware DVFS. IEEE Trans. Comput. 59, 1 (Jan. 2010), 127--133.
[19]
S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO’09).
[20]
W. Liu, Y. Tan, and Q. Qiu. 2010. Enhanced Q-learning algorithm for dynamic power management with performance constraint. In Proceedings of the Design, Automation and Test in Europe Conference (DATE’10). 602--605.
[21]
Shiting (Justin) Lu, Russell Tessier, and Wayne Burleson. 2015. Reinforcement learning for thermal-aware many-core task allocation. In Proceedings of the Great Lakes Symposium on VLSI.
[22]
M. A. Makhzan, A. Khajeh, A. Eltawil, and F. Kurdahi. 2007. Limits on voltage scaling for caches utilizing fault tolerant techniques. In Proceedings of the International Conference on Computer Design.
[23]
P. D. Sai Manoj, A. Jantsch, and M. Shafique. 2018. SmartDPM: Dynamic power management using machine learning for multi-core microprocessors. J. Low-Power Electron. 14, 4 (Dec. 2018).
[24]
P. D. Sai Manoj, J. Lin, S. Zhu, Y. Yin, X. Liu, X. Huang, C. Song, W. Zhang, M. Yan, Z. Yu, and H. Yu. 2017. A scalable network-on-chip microprocessor with 2.5D integrated memory and accelerator. IEEE Trans. Circ. Syst. I: Reg. Papers 64, 6 (June 2017), 1432--1443.
[25]
P. D. Sai Manoj, H. Yu, H. Huang, and D. Xu. 2016. A Q-Learning based self-adaptive I/O communication for 2.5D integrated many-core microprocessor and memory. IEEE Trans. Comput. 65, 4 (Apr. 2016), 1185--1196.
[26]
P. D. Sai Manoj, H. Yu, Y. Shang, C. S. Tan, and S. K. Lim. 2013. Reliable 3-D clock-tree synthesis considering nonlinear capacitive TSV model with electrical-thermal-mechanical coupling. IEEE Trans. Comput.-Aided Des. Integr. Circ. Syst. 32, 11 (Nov. 2013), 1734--1747.
[27]
P. D. Sai Manoj, H. Yu, and K. Wang. 2015. 3D Many-core microprocessor power management by space-time multiplexing based demand-supply matching. IEEE Trans. Comput. 64, 11 (Nov. 2015), 3022--3036.
[28]
S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. 2002. Detailed design and evaluation of redundant multi-threading alternatives. In Proceedings of the International Symposium on Computer Architecture.
[29]
S. Pagani et al. 2017. Energy efficiency for clustered heterogeneous multicores. IEEE Trans. Parallel Distrib. Syst. 28, 5 (May 2017), 1315--1330.
[30]
S. Pagani, H. Khdr, W. Munawar, J. Chen, M. Shafique, M. Li, and J. Henkel. 2014. TSP: Thermal safe power—Efficient power budgeting for many-core systems in dark silicon. In Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis.
[31]
S. Pagani, P. D. Sai Manoj, A. Jantsch, and J. Henkel. 2018. Machine learning for power, energy, and thermal management on multi-core processors: A survey. IEEE Trans. Comput.-Aided Des. Integ. Circ. Syst. PP, 1--17.
[32]
X. Qi, D. Zhu, and H. Aydin. 2010. Global reliability-aware power management for multiprocessor real-time systems. In Proceedings of the IEEE International Conference on Embedded and Real-Time Computing Systems and Applications.
[33]
Amir M. Rahmani et al. 2017. Reliability-aware runtime power management for many-core systems in the dark silicon era. IEEE Trans. Very Large Scale Integr. Syst. 25, 2 (Feb. 2017), 427--440.
[34]
Krishna K. Rangan, Gu-Yeon Wei, and David Brooks. 2009. Thread motion: Fine-grained power management for multi-core systems. SIGARCH Comput. Archit. News 37, 3 (Jun. 2009), 302--313.
[35]
B. Rountree et al. 2011. Practical performance prediction under dynamic voltage frequency scaling. In Proceedings of the International Green Computing Conference and Workshops.
[36]
M. Salehi et al. 2015. dsReliM: Power-constrained reliability management in dark-silicon many-core chips under process variations. In Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’15).
[37]
M. Salehi, M. K. Tavana, S. Rehman, F. Kriebel, M. Shafique, A. Ejlali, and J. Henkel. 2015. DRVS: Power-efficient reliability management through dynamic redundancy and voltage scaling under variations. In Proceedings of the International Symposium on Low Power Electronics and Design.
[38]
Avesta Sasan, Houman Homayoun, Ahmed Eltawil, and Fadi Kurdahi. 2009. A fault tolerant cache architecture for sub 500mV operation: Resizable data composer cache (RDC-cache). In Proceedings of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems.
[39]
N. Seifert, B. Gill, S. Jahinuzzaman, J. Basile, V. Ambrose, Q. Shi, R. Allmon, and A. Bramnik. 2012. Soft error susceptibilities of 22 nm tri-gate devices. IEEE Trans. Nucl. Sci. 59, 6 (Dec. 2012), 2666--2673.
[40]
Muhammad Shafique, Siddharth Garg, Jörg Henkel, and Diana Marculescu. 2014. The EDA challenges in the dark silicon era: Temperature, reliability, and variability perspectives. In Proceedings of the Design Automation Conference.
[41]
M. Shafique, A. Ivanov, B. Vogel, and J. Henkel. 2016. Scalable power management for on-chip systems with malleable applications. IEEE Trans. Comput. 65, 11 (Nov. 2016), 3398--3412.
[42]
H. Shen, J. Lu, and Q. Qiu. 2012. Learning-based DVFS for simultaneous temperature, performance and energy management. In Proceedings of the International Symposium on Quality Electronic Design (ISQED’12).
[43]
Hao Shen, Ying Tan, Jun Lu, Qing Wu, and Qinru Qiu. 2013. Achieving autonomous power management using reinforcement learning. ACM Trans. Des. Auto. Electron. Syst. 18, 2 (Apr. 2013), 24:1--24:32.
[44]
A. Shye, T. Moseley, V. J. Reddi, J. Blomstedt, and D. A. Connors. 2007. Using process-level redundancy to exploit multiple cores for transient fault tolerance. In Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks.
[45]
R. Singhal. 2008. Inside Intel® core microarchitecture (Nehalem). In Proceedings of the IEEE Hot Chips Symposium.
[46]
J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers. 2004. The impact of technology scaling on lifetime reliability. In Proceedings of the International Conference on Dependable Systems and Networks.
[47]
Jayanth Srinivasan, S. V. Adve, Pradip Bose, and J. A. Rivers. 2005. Lifetime reliability: Toward an architectural solution. IEEE Micro 25, 3 (May 2005), 70--80.
[48]
K. Swaminathan, N. Chandramoorthy, C. Y. Cher, R. Bertran, A. Buyuktosunoglu, and P. Bose. 2017. BRAVO: Balanced reliability-aware voltage optimization. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’17).
[49]
Ying Tan, Wei Liu, and Qinru Qiu. 2009. Adaptive power management using reinforcement learning. In Proceedings of the International Conference on Computer-Aided Design (ICCAD’09). 461--467.
[50]
S. J. Tarsa, A. P. Kumar, and H. T. Kung. 2014. Workload prediction for adaptive power scaling using deep learning. In Proceedings of the IEEE International Conference on IC Design Technology.
[51]
Yanzhi Wang et al. 2011. Deriving a near-optimal power management policy using model-free reinforcement learning and Bayesian classification. In Proceedings of the 48th Design Automation Conference (DAC’11).
[52]
Y. Wang and M. Pedram. 2016. Model-free reinforcement learning and Bayesian classification in system-level power management. IEEE Trans. Comput. 65, 12 (Mar. 2016), 3713--3726.
[53]
E. Wu, J. Suñé, W. Lai, E. Nowak, J. McKenna, A. Vayshenker, and D. Harmon. 2002. Interplay of voltage and temperature acceleration of oxide breakdown for ultra-thin gate oxides. Solid-State Electron. 46, 11 (2002), 1787--1798.
[54]
K. Wu and D. Marculescu. 2014. Power-planning-aware soft error hardening via selective voltage assignment. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 22, 1 (Jan. 2014), 136--145.
[55]
S. S. Wu, K. Wang, P. D. Sai Manoj, T. Y. Ho, M. Yu, and H. Yu. 2014. A thermal resilient integration of many-core microprocessors and main memory by 2.5D TSI I/Os. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE’14).
[56]
D. Xu, N. Yu, P. D. Sai Manoj, K. Wang, H. Yu, and M. Yu. 2015. A 2.5-D Memory-logic integration with data-pattern-aware memory controller. IEEE Design Test 32, 4 (Aug. 2015), 1--10.
[57]
X. Xu, K. Teramoto, A. Morales, and H. H. Huang. 2013. DUAL: Reliability-aware power management in data centers. In Proceedings of the IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.
[58]
Sheng Yang et al. 2015. Adaptive energy minimization of embedded heterogeneous systems using regression-based learning. In Proceedings of the International Workshop on Power and Timing Modeling, Optimization and Simulation.
[59]
M. Zaman et al. 2015. Workload characterization and prediction: A pathway to reliable multi-core systems. In Proceedings of the IEEE International On-Line Testing Symposium.
[60]
Dakai Zhu, R. Melhem, and D. Mosse. 2004. The effects of energy management on reliability in real-time embedded systems. In Proceedings of the IEEE/ACM International Conference on Computer Aided Design.

Cited By

View all
  • (2024)A Two-Level Thermal Cycling-Aware Task Mapping Technique for Reliability Management in Manycore SystemsIEEE Access10.1109/ACCESS.2024.344353912(113406-113421)Online publication date: 2024
  • (2024)A multi-agent reinforcement learning-based method for server energy efficiency optimization combining DVFS and dynamic fan controlSustainable Computing: Informatics and Systems10.1016/j.suscom.2024.10097742(100977)Online publication date: Apr-2024
  • (2023)Learning-Oriented Reliability Improvement of Computing Systems From Transistor to Application Level2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE56975.2023.10137182(1-10)Online publication date: Apr-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Journal on Emerging Technologies in Computing Systems
ACM Journal on Emerging Technologies in Computing Systems  Volume 15, Issue 4
Special Issue on HALO for Energy-Constrained On-Chip Machine Learning, Part 2 and Regular Papers
October 2019
226 pages
ISSN:1550-4832
EISSN:1550-4840
DOI:10.1145/3365594
  • Editor:
  • Ramesh Karri
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 10 October 2019
Accepted: 01 March 2019
Revised: 01 December 2018
Received: 01 July 2018
Published in JETC Volume 15, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. DVFS
  2. Multi-core processor
  3. application reliability
  4. power management
  5. reinforcement learning
  6. thermal reliability

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • German Research Foundation (DFG)

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)47
  • Downloads (Last 6 weeks)2
Reflects downloads up to 01 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Two-Level Thermal Cycling-Aware Task Mapping Technique for Reliability Management in Manycore SystemsIEEE Access10.1109/ACCESS.2024.344353912(113406-113421)Online publication date: 2024
  • (2024)A multi-agent reinforcement learning-based method for server energy efficiency optimization combining DVFS and dynamic fan controlSustainable Computing: Informatics and Systems10.1016/j.suscom.2024.10097742(100977)Online publication date: Apr-2024
  • (2023)Learning-Oriented Reliability Improvement of Computing Systems From Transistor to Application Level2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE56975.2023.10137182(1-10)Online publication date: Apr-2023
  • (2023)NPU-Accelerated Imitation Learning for Thermal Optimization of QoS-Constrained Heterogeneous Multi-CoresACM Transactions on Design Automation of Electronic Systems10.1145/362632029:1(1-23)Online publication date: 15-Nov-2023
  • (2023)Dynamic Power Management in Large Manycore Systems: A Learning-to-Search FrameworkACM Transactions on Design Automation of Electronic Systems10.1145/360350128:5(1-21)Online publication date: 8-Sep-2023
  • (2023)Power-Efficient and Aging-Aware Primary/Backup Technique for Heterogeneous Embedded SystemsIEEE Transactions on Sustainable Computing10.1109/TSUSC.2023.32821648:4(715-726)Online publication date: Oct-2023
  • (2023)ATLAS: Aging-Aware Task Replication for Multicore Safety-Critical Systems2023 IEEE 29th Real-Time and Embedded Technology and Applications Symposium (RTAS)10.1109/RTAS58335.2023.00025(223-234)Online publication date: May-2023
  • (2023)ReLIEF: A Reinforcement-Learning-Based Real-Time Task Assignment Strategy in Emerging Fault-Tolerant Fog ComputingIEEE Internet of Things Journal10.1109/JIOT.2023.324000710:12(10752-10763)Online publication date: 15-Jun-2023
  • (2023)Physics-driven proper orthogonal decomposition: A simulation methodology for partial differential equationsMethodsX10.1016/j.mex.2023.10220410(102204)Online publication date: 2023
  • (2023)Power Management of Multicore SystemsHandbook of Computer Architecture10.1007/978-981-15-6401-7_55-1(1-33)Online publication date: 1-Apr-2023
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media