Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3061639.3062301acmconferencesArticle/Chapter ViewAbstractPublication PagesdacConference Proceedingsconference-collections
research-article
Public Access

Phase-driven Learning-based Dynamic Reliability Management For Multi-core Processors

Published: 18 June 2017 Publication History

Abstract

In this paper, we propose a phase-driven Q-learning based dynamic reliability management (DRM) technique for multi-core processors to solve DRM problems of maximizing the processor performance subject to a large class of reliability constraints by turning ON/OFF cores and dynamic voltage frequency scaling. Our technique utilizes the existing methods to detect program phases (i.e. [17]) and learns (rather than obtaining at the off-line stage) the optimal configuration of the multi-core processor for each phase. Our technique outperforms the existing learning-based DRM methods in managing programs with highly diverse phases. Our proposed technique is evaluated by solving a DRM problem in 3D CPUs of maximizing processor performance subject to the electromigration induced power delivery network reliability constraint. Compared to the latest Q-learning based DRM technique [11], our method can achieve more than 1.3x improvement in performance with 77% memory savings.

References

[1]
H. Amrouch, et al. Reliability-aware design to suppress aging. In Proceedings of the 53rd DAC, page 12. ACM, 2016.
[2]
C. Bienia, et al. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 72--81. ACM, 2008.
[3]
R. Cochran and S. Reda. Consistent runtime thermal prediction and control through workload phase detection. In Proceedings of the 47th DAC, pages 62--67. ACM, 2010.
[4]
A. Das, et al. Reinforcement learning-based inter-and intra-application thermal optimization for lifetime improvement of multicore systems. In 51st DAC, pages 1--6. IEEE, 2014.
[5]
T. Frank, et al. Electromigration behavior of 3D-IC TSV interconnects. In ECTC, 2012 IEEE 62nd, pages 326--330. IEEE, 2012.
[6]
Y. Ge and Q. Qiu. Dynamic thermal management for multimedia applications using machine learning. In Design Automation Conference (DAC), 2011 48th ACM/EDAC/IEEE, pages 95--100. IEEE, 2011.
[7]
M. S. Gupta, et al. Understanding voltage variations in chip multiprocessors using a distributed power-delivery network. In 2007 Design, Automation & Test in Europe Conference & Exhibition, pages 1--6. IEEE, 2007.
[8]
M. B. Healy and S. K. Lim. Power delivery system architecture for many-tier 3D systems. In 2010 Proceedings 60th ECTC, pages 1682--1688. IEEE, 2010.
[9]
L. Huang, et al. Lifetime reliability-aware task allocation and scheduling for MPSoC platforms. In Proceedings of the DATE, pages 51--56. EDAA, 2009.
[10]
C. Isci, et al. Live, runtime phase monitoring and prediction on real systems with application to dynamic power management. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, pages 359--370. IEEE Computer Society, 2006.
[11]
T. Kim, et al. Learning-based dynamic reliability management for dark silicon processor considering EM effects. In 2016 DATE, pages 463--468. IEEE, 2016.
[12]
T. Kim, et al. Invited-Cross-layer modeling and optimization for electromigration induced reliability. In Proceedings of the 53rd DAC, page 30. ACM, 2016.
[13]
S. Li, et al. McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In Microarchitecture, MICRO-42, pages 469--480. IEEE, 2009.
[14]
S. Padmanabha, et al. Trace based phase prediction for tightly-coupled heterogeneous cores. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pages 445--456. ACM, 2013.
[15]
J. S. Pak, et al. PDN impedance modeling and analysis of 3D TSV IC by using proposed P/G TSV array model based on separated P/G TSV and chip-PDN models. IEEE Transactions on Components, Packaging and Manufacturing Technology, 1(2):208--219, 2011.
[16]
C. Serafy, et al. Continued frequency scaling in 3D ICs through micro-fluidic cooling. In Thermal and Thermomechanical Phenomena in Electronic Systems (ITherm), 2014 IEEE Intersociety Conference on, pages 79--85. IEEE, 2014.
[17]
T. Sherwood, et al. Discovering and exploiting program phases. IEEE micro, 23(6):84--93, 2003.
[18]
B. Shi, et al. Hybrid 3D-IC cooling system using micro-fluidic cooling and thermal TSVs. In 2012 IEEE Computer Society Annual Symposium on VLSI, pages 33--38. IEEE, 2012.
[19]
R. Ubal, et al. Multi2Sim: a simulation framework for CPU-GPU computing. In Proceedings of the 21st international conference on Parallel architectures and compilation techniques, pages 335--344. ACM, 2012.
[20]
S. C. Woo, et al. The SPLASH-2 programs: Characterization and methodological considerations. In ACM SIGARCH computer architecture news, volume 23, pages 24--36. ACM, 1995.

Cited By

View all
  • (2023)Learning-Oriented Reliability Improvement of Computing Systems From Transistor to Application Level2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE56975.2023.10137182(1-10)Online publication date: Apr-2023
  • (2023)COP: A Combinational Optimization Power Budgeting Method for Manycore Systems in Dark SiliconIEEE Transactions on Computers10.1109/TC.2022.321141772:5(1356-1370)Online publication date: 1-May-2023
  • (2019)Enhanced Phase-Driven $Q$ -Learning-Based DRM for Multicore ProcessorsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2018.287701438:11(2022-2031)Online publication date: Nov-2019
  • Show More Cited By
  1. Phase-driven Learning-based Dynamic Reliability Management For Multi-core Processors

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      DAC '17: Proceedings of the 54th Annual Design Automation Conference 2017
      June 2017
      533 pages
      ISBN:9781450349277
      DOI:10.1145/3061639
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      In-Cooperation

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 18 June 2017

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      Conference

      DAC '17
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

      Upcoming Conference

      DAC '25
      62nd ACM/IEEE Design Automation Conference
      June 22 - 26, 2025
      San Francisco , CA , USA

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)62
      • Downloads (Last 6 weeks)11
      Reflects downloads up to 15 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Learning-Oriented Reliability Improvement of Computing Systems From Transistor to Application Level2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE56975.2023.10137182(1-10)Online publication date: Apr-2023
      • (2023)COP: A Combinational Optimization Power Budgeting Method for Manycore Systems in Dark SiliconIEEE Transactions on Computers10.1109/TC.2022.321141772:5(1356-1370)Online publication date: 1-May-2023
      • (2019)Enhanced Phase-Driven $Q$ -Learning-Based DRM for Multicore ProcessorsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2018.287701438:11(2022-2031)Online publication date: Nov-2019
      • (2019)A Lifetime Reliability-Constrained Runtime Mapping for Throughput Optimization in Many-Core SystemsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2018.285516838:9(1771-1784)Online publication date: Sep-2019

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media