Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Public Access

Identifying Power-Efficient Multicore Cache Hierarchies via Reuse Distance Analysis

Published: 06 April 2016 Publication History
  • Get Citation Alerts
  • Abstract

    To enable performance improvements in a power-efficient manner, computer architects have been building CPUs that exploit greater amounts of thread-level parallelism. A key consideration in such CPUs is properly designing the on-chip cache hierarchy. Unfortunately, this can be hard to do, especially for CPUs with high core counts and large amounts of cache. The enormous design space formed by the combinatorial number of ways in which to organize the cache hierarchy makes it difficult to identify power-efficient configurations. Moreover, the problem is exacerbated by the slow speed of architectural simulation, which is the primary means for conducting such design space studies.
    A powerful tool that can help architects optimize CPU cache hierarchies is reuse distance (RD) analysis. Recent work has extended uniprocessor RD techniques-i.e., by introducing concurrent RD and private-stack RD profiling—to enable analysis of different types of caches in multicore CPUs. Once acquired, parallel locality profiles can predict the performance of numerous cache configurations, permitting highly efficient design space exploration. To date, existing work on multicore RD analysis has focused on developing the profiling techniques and assessing their accuracy. Unfortunately, there has been no work on using RD analysis to optimize CPU performance or power consumption.
    This article investigates applying multicore RD analysis to identify the most power efficient cache configurations for a multicore CPU. First, we develop analytical models that use the cache-miss counts from parallel locality profiles to estimate CPU performance and power consumption. Although future scalable CPUs will likely employ multithreaded (and even out-of-order) cores, our current study assumes single-threaded in-order cores to simplify the models, allowing us to focus on the cache hierarchy and our RD-based techniques. Second, to demonstrate the utility of our techniques, we apply our models to optimize a large-scale tiled CPU architecture with a two-level cache hierarchy. We show that the most power efficient configuration varies considerably across different benchmarks, and that our locality profiles provide deep insights into why certain configurations are power efficient. We also show that picking the best configuration can provide significant gains, as there is a 2.01x power efficiency spread across our tiled CPU design space. Finally, we validate the accuracy of our techniques using detailed simulation. Among several simulated configurations, our techniques can usually pick the most power efficient configuration, or one that is very close to the best. In addition, across all simulated configurations, we can predict power efficiency with 15.2% error.

    References

    [1]
    Erik Berg and Erik Hagersten. 2004. StatCache: A probabilistic approach to efficient and accurate data locality analysis. In Proceedings of the 2004 IEEE International Symposium on Performance Analysis of Systems and Software.
    [2]
    Erik Berg and Erik Hagersten. 2005. Fast data-locality profiling of native execution. In Proceedings of the ACM SIGMETRICS Conference.
    [3]
    Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques.
    [4]
    Kenzo Van Craeynest and Lieven Eeckhout. 2011. The multi-program performance model: Debunking current practice in multi-core simulation. In Proceedings of the 2011 IEEE International Symposium on Workload Characterization.
    [5]
    John Davis, James Laudon, and Kunle Olukotun. 2005. Maximizing CMP throughput with mediocre cores. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques.
    [6]
    Peter J. Denning. 1968. The working set model for program behavior. Communications of the ACM 11, 5, 323--333.
    [7]
    Chen Ding and Trishul Chilimbi. 2009. A Composable Model for Analyzing Locality of Multi-Threaded Programs. Technical Report MSR-TR-2009-107. Microsoft Research.
    [8]
    Chen Ding and Yutao Zhong. 2003. Predicting whole-program locality through reuse distance analysis. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation.
    [9]
    David Eklov, David Black-Schaffer, and Erik Hagersten. 2011. Fast modeling of shared caches in multicore systems. In Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers.
    [10]
    David Eklov and Erik Hagersten. 2010. Statstack: Efficient modeling of LRU caches. In Proceedings of the 2010 IEEE International Symposium on Performance Analysis of Systems and Software.
    [11]
    Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and Jim Smith. 2009. A mechanistic performance model for superscalar out-of-order processors. ACM Transactions on Computer Systems 27, 2, Article No. 3.
    [12]
    J. Gecsei, D. R. Slutz, and I. L. Traiger. 1970. Evaluation techniques for storage hierarchies. IBM Systems Journal 9, 2, 78--117.
    [13]
    Song-Liu Guo, Hai-Xia Wang, Yi-Bo Xue, Chong-Min Li, and Dong-Sheng Wang. 2010. Hierarchical cache directory for CMP. Journal of Computer Science and Technology 25, 2, 246--256.
    [14]
    Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki. 2009. Reactive NUCA: Near-optimal block placement and replication in distributed caches. In Proceedings of the 36th International Symposium on Computer Architecture.
    [15]
    Lisa Hsu, Ravi Iyer, Srihari Makineni, Steve Reinhardt, and Donald Newell. 2005. Exploring the cache design space for large scale CMPs. ACM SIGARCH Computer Architecture News, 4, 24--33.
    [16]
    Jaehyuk Huh, Stephen W. Keckler, and Doug Burger. 2001. Exploring the design space of future CMPs. In Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques.
    [17]
    Intel. 2014. Intel Xeon Phi Product Family. Available at http://www.intel.com/XeonPhi.
    [18]
    Engin Ïpek, Sally A. McKee, Rich Caruana, Bronis R. de Supinski, and Martin Schulz. 2006. Efficiently exploring architectural design spaces via predictive modeling. In Proceedings of the Conference on Architectural Support for Programming Languages and Operating Systems.
    [19]
    Yunlian Jiang, Eddy Z. Zhang, Kai Tian, and Xipeng Shen. 2010. Is reuse distance applicable to data locality analysis on chip multiprocessors? In Proceeding of the International Conference on Compiler Construction.
    [20]
    Benjamin C. Lee and David M. Brooks. 2006. Accurate and efficient regression modeling for microarchitectural performance and power prediction. In Proceedings of the Conference on Architectural Support for Programming Languages and Operating Systems.
    [21]
    Jian Li and Jose F. Martinez. 2005. Power-performance implications of thread-level parallelism on chip multiprocessors. In Proceedings of the International Symposium on Performance Analysis of Systems and Software.
    [22]
    Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the International Symposium on Microarchitecture.
    [23]
    Yingmin Li, Benjamin Lee, David Brooks, Zhigang Hu, and Kevin Skadron. 2006. CMP design space exploration subject to physical constraints. In Proceedings of the 12th International Symposium on High-Performance Computer Architecture.
    [24]
    Gabriel H. Loh. 2008. 3D-stacked memory architectures for multi-core processors. In Proceedings of the 35th International Symposium on Computer Architecture.
    [25]
    Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation.
    [26]
    Martina Maggio, Henry Hoffman, Anant Agarwal, and Alberto Leva. 2011. Control-theoretical CPU allocation: Design and implementation with feedback control. In Proceedings of the 6th International Workshop on Feedback Control Implementation and Design in Computing Systems and Networks.
    [27]
    Jason E. Miller, Harshad Kasture, George Kurian, Charles Gruenwald III, Nathan Beckmann, Christopher Celio, Jonathan Eastep, and Anant Agarwal. 2010. Graphite: A distributed parallel simulator for multicores. In Proceedings of the 16th IEEE International Symposium on High-Performance Computer Architecture.
    [28]
    Ramanathan Narayanan, Berkin Ozisikyilmaz, Joseph Zambreno, Gokham Memik, and Alok Choudhary. 2006. MineBench: A benchmark suite for data mining workloads. In Proceedings of the International Symposium on Workload Characterization.
    [29]
    Apan Qasem and Ken Kennedy. 2005. Evaluating a Model for Cache Conflict Miss Prediction. Technical Report CS-TR05-457. Rice University.
    [30]
    Brian Rogers, Anil Krishna, Gordon Bell, Ken Vu, Xiaowei Jiang, and Yan Solihin. 2009. Scaling the bandwidth wall: Challenges in and avenues for CMP scaling. In Proceedings of the 36th International Symposium on Computer Architecture.
    [31]
    Derek L. Schuff, Milind Kulkarni, and Vijay S. Pai. 2010. Accelerating multicore reuse distance analysis with sampling and parallelization. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques.
    [32]
    Derek L. Schuff, Benjamin S. Parsons, and Vijay S. Pai. 2009. Multicore-Aware Reuse Distance Analysis. Technical Report TR-ECE-09-08. Purdue University.
    [33]
    Chen Sun, Chia-Hsin Owen Chen, George Kurian, Lan Wei, Jason Miller, Anant Agarwal, Li-Shiuan Peh, and Vladimir Stojanovic. 2012. DSENT—a tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In Proceedings of the 6th International Symposium on Networks-on-Chip.
    [34]
    Deborah A. Wallach. 1993. PHD: A Hierarchical Cache Coherent Protocol. Master’s Thesis. Massachusetts Institute of Technology.
    [35]
    Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the International Symposium on Computer Architecture.
    [36]
    Meng-Ju Wu and Donald Yeung. 2011. Coherent profiles: Enabling efficient reuse distance analysis of multicore scaling for loop-based parallel programs. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques.
    [37]
    Meng-Ju Wu and Donald Yeung. 2012. Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis. In Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness.
    [38]
    Meng-Ju Wu and Donald Yeung. 2013. Efficient reuse distance analysis of multicore scaling for loop-based parallel programs. ACM Transactions on Computer Systems 31, 1, Article No. 1.
    [39]
    Meng-Ju Wu, Minshu Zhao, and Donald Yeung. 2013. Studying multicore processor scaling via reuse distance analysis. In Proceedings of the 40th International Symposium on Computer Architecture.
    [40]
    Michael Zhang and Krste Asanovic. 2005. Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In Proceedings of the 32nd International Symposium on Computer Architecture.
    [41]
    Li Zhao, Ravi Iyer, Srihari Makineni, Jaideep Moses, Ramesh Illikkal, and Donald Newell. 2007. Performance, area and bandwidth implications on large-scale CMP cache design. In Proceedings of the Workshop on Chip Multiprocessor Memory Systems and Interconnect.
    [42]
    Yutao Zhong, Steven G. Dropsho, and Chen Ding. 2003. Miss rate prediction across all program inputs. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques.
    [43]
    Yutao Zhong, Xipeng Shen, and Chen Ding. 2009. Program locality analysis using reuse distance. ACM Transactions on Programming Languages and Systems 31, 6, Article No. 20.

    Cited By

    View all
    • (2023)MRAM-Based Cache System Design and Policy Optimization for RISC-V Multi-Core CPUsIEEE Transactions on Magnetics10.1109/TMAG.2023.326746759:6(1-14)Online publication date: Jun-2023
    • (2022)PPT-Multicore: performance prediction of OpenMP applications using reuse profiles and analytical modelingThe Journal of Supercomputing10.1007/s11227-021-03949-478:2(2354-2385)Online publication date: 1-Feb-2022
    • (2021) ETICA: E fficient T wo-Level I /O C aching A rchitecture for Virtualized Platforms IEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.306630832:10(2415-2433)Online publication date: 1-Oct-2021
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Computer Systems
    ACM Transactions on Computer Systems  Volume 34, Issue 1
    April 2016
    91 pages
    ISSN:0734-2071
    EISSN:1557-7333
    DOI:10.1145/2912578
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 April 2016
    Accepted: 01 November 2015
    Revised: 01 October 2015
    Received: 01 October 2014
    Published in TOCS Volume 34, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Cache performance
    2. chip multiprocessors
    3. design space exploration
    4. reuse distance

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • DARPA
    • NSF

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)61
    • Downloads (Last 6 weeks)5

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)MRAM-Based Cache System Design and Policy Optimization for RISC-V Multi-Core CPUsIEEE Transactions on Magnetics10.1109/TMAG.2023.326746759:6(1-14)Online publication date: Jun-2023
    • (2022)PPT-Multicore: performance prediction of OpenMP applications using reuse profiles and analytical modelingThe Journal of Supercomputing10.1007/s11227-021-03949-478:2(2354-2385)Online publication date: 1-Feb-2022
    • (2021) ETICA: E fficient T wo-Level I /O C aching A rchitecture for Virtualized Platforms IEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.306630832:10(2415-2433)Online publication date: 1-Oct-2021
    • (2019)An Improved Scheme of Victim Replication in Tiled Chip Multiprocessors2019 IEEE 3rd International Conference on Circuits, Systems and Devices (ICCSD)10.1109/ICCSD.2019.8842919(16-20)Online publication date: Aug-2019
    • (2019)Featherlight Reuse-Distance Measurement2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2019.00056(440-453)Online publication date: Feb-2019
    • (2019)Analyzing data locality in GPU kernels using memory footprint analysisSimulation Modelling Practice and Theory10.1016/j.simpat.2018.12.00391(102-122)Online publication date: Feb-2019
    • (2018)Efficient Cache Performance Modeling in GPUs Using Reuse Distance AnalysisACM Transactions on Architecture and Code Optimization10.1145/329105115:4(1-24)Online publication date: 19-Dec-2018
    • (2017)Optimizing locality in graph computations using reuse distance profiles2017 IEEE 36th International Performance Computing and Communications Conference (IPCCC)10.1109/PCCC.2017.8280444(1-8)Online publication date: Dec-2017
    • (2017)Guiding Locality Optimizations for Graph Computations via Reuse Distance AnalysisIEEE Computer Architecture Letters10.1109/LCA.2017.269517816:2(119-122)Online publication date: 1-Jul-2017

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media