Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Memory Row Reuse Distance and its Role in Optimizing Application Performance

Published: 15 June 2015 Publication History
  • Get Citation Alerts
  • Abstract

    Continuously increasing dataset sizes of large-scale applications overwhelm on-chip cache capacities and make the performance of last-level caches (LLC) increasingly important. That is, in addition to maximizing LLC hit rates, it is becoming equally important to reduce LLC miss latencies. One of the critical factors that influence LLC miss latencies is row-buffer locality (i.e., the fraction of LLC misses that hit in the large buffer attached to a memory bank). While there has been a plethora of recent works on optimizing row-buffer performance, to our knowledge, there is no study that quantifies the full potential of row-buffer locality and impact of maximizing it on application performance.
    Focusing on multithreaded applications, the first contribution of this paper is the definition of a new metric called (memory) row reuse distance (RRD). We show that, while intra-core RRDs are relatively small (increasing the chances for row-buffer hits), inter-core RRDs are quite large (increasing the chances for row-buffer misses). Motivated by this, we propose two schemes that measure the maximum potential benefits that could be obtained from minimizing RRDs, to the extent allowed by program dependencies. Specifically, one of our schemes (Scheme-I) targets only intra-core RRDs, whereas the other one (Scheme-II) aims at reducing both intra-core RRDs and inter-core RRDs. Our experimental evaluations demonstrate that (i) Scheme-I reduces intra-core RRDs but increases inter-core RRDs; (ii) Scheme-II reduces inter-core RRDs significantly while achieving a similar behavior to Scheme-I as far as intra-core RRDs are concerned; (iii) Scheme-I and Scheme-II improve execution times of our applications by 17% and 21%, respectively, on average; and (iv) both our schemes deliver consistently good results under different memory request scheduling policies.

    References

    [1]
    M. Xie, D. Tong, K. Huang and X. Cheng, Improving system throughput and fairness simultaneously in shared memory CMP systems via Dynamic Bank Partitioning, HPCA, 2014.
    [2]
    L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu, The Blacklisting Memory Scheduler:Achieving High Performance and Fairness at Low Cost, ICCD, 2014.
    [3]
    B. T. Davis, Modern DRAM Architectures. PhD thesis, University of Michigan, 2000.
    [4]
    W. Ding, D. Guttman and M. Kandemir, Compiler Support for Optimizing Memory Bank-Level Parallelism, MICRO, 2014.
    [5]
    S. O,Y. H. Son, N. S. Kim and J. H. Ahn, Row-buffer decoupling: a case for low-latency DRAM microarchitecture, ISCA, 2014.
    [6]
    D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting inter-thread cache contention on a chip multi-processor architecture, HPCA, 2005.
    [7]
    J. Chang and G. S. Sohi, Cooperative cache partitioning for chip multiprocessors, ICS, 2007.
    [8]
    A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. Steely Jr. and J. Emer, Adaptive insertion policies for managing shared caches, PACT, 2008.
    [9]
    M. Kandemir, S. P. Muralidhara, S. H. K. Narayanan, Y. Zhang, O. Ozturk, Optimizing shared cache behaviorof chip multiprocessors, MICRO, 2009.
    [10]
    S. Kim, D. Chandra and Y. Solihin Fair cache sharing and partitioning in achip multiprocessor architecture, PACT, 2004.
    [11]
    S. Rixner, Memory controller optimizations for web servers, MICRO, 2004.
    [12]
    S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens, Memory access scheduling, ISCA, 2000.
    [13]
    Z. Zhang, Z. Zhu, and X. Zhang, A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality, MICRO, 2000.
    [14]
    S. M. Zahedi, and B. C. Lee, REF: resource elasticity fairness with sharing incentives for multiprocessors, ASPLOS, 2014.
    [15]
    H. Wang, R. Singh, M. J. Schulte, and N. S. Kim, Memory scheduling towards high-throughput cooperative heterogeneous computing, PACT, 2014.
    [16]
    J. Hasan, S. Chandra, and T. N. Vijaykumar, Efficient Use of Memory Bandwidth to Improve Network Processor Throughput, ISCA, 2003.
    [17]
    H. Yoon, J. Meza, R. Ausavarungnirun, R. A. Harding and O. Mutlu, Row Buffer Locality Aware Caching Policies for Hybrid Memories, ICCD, 2012.
    [18]
    K. Sudan, N. Chatterjee, D. Nellans, M. Awasthi, R. Balasubramonian and A. Davis, Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement, ASPLOS, 2010.
    [19]
    Y. Zhang, M. T. Kandemir and T. Yemliha, Studying inter-core data reuse in multicores, SIGMETRICS, 2011.
    [20]
    Y. Kim, D. Han, O. Mutlu and M. Harchol-Balter, ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers, HPCA, 2010.
    [21]
    JEDEC Solid State Technology Association, DDR3 SDRAM Specification, JESD79--3D edition, Sept, 2009
    [22]
    Calculating Memory System Power for DDR3, Technical report, Micron Technology Inc., 2--7, TN-4-01, 2007.
    [23]
    K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith. Fair queuing memory systems, MICRO, 2006.
    [24]
    O. Mutlu and T. Moscibroda. Stall-time fair memory access scheduling for chip multiprocessor, MICRO, 2007.
    [25]
    O. Mutlu and T. Moscibroda, Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems, ISCA, 2008.
    [26]
    T. E. Carlson, W. Heirman, and L. Eeckhout, Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulations, SC, 2011.
    [27]
    D. Chen and Y. Zhong, Predicting whole-program locality through reuse distance analysis, PLDI, 2003.
    [28]
    G. Keramidas, P. Petoumenos and S. Kaxiras, Cache Replacement Based on Reuse-Distance Prediction, ICCD, 2007.
    [29]
    A. Jaleel, K. B. Theobald, S. C. Steely Jr. and J. Emer, High Performance Cache Replacement Using Re-Reference Interval Prediction, ISCA, 2007.
    [30]
    K. Beyls and E. H. D'Hollander, Reuse distance as a metric for cache behavior, IPDCS, 2001.
    [31]
    G. Almasi, C. Cascaval and D. A. Padua, Calculating stack distances efficiently, SIGPLAN Not., 2003
    [32]
    Y. Jiang, E. Z. Zhang, K. Tian, X. Shen, Is reuse distance applicable to data locality analysis on chip multiprocessors?, Compiler Construction, 2010.
    [33]
    M. Kandemir, A compiler technique for improving whole-program locality, POPL, 2001.
    [34]
    D. L. Schuff, M. Kulkarni, and V. S. Pai, Accelerating multicore reuse distance analysis with sampling and parallelization, PACT, 2010.
    [35]
    Y. Kim, M. Papamichael, O. Mutlu and M. Harchol-Balter, Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior, MICRO, 2010.
    [36]
    M. Awasthi, D. Nellans, K. Sudan, R. Balasubramonian and A. Davis, Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers, PACT, 2010.
    [37]
    H. Park, S. Baek, J. Choi, D. Lee and S. Noh, Regularities considered harmful: forcing randomness to memory accesses to reduce row-buffer conflicts for multi-core, multi-bank systems, ASPLOS, 2013.
    [38]
    R. Barrett, R. Barrett, M. Berry3, T. F. Chan, J. Demmel, J. M. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. Van der Vorst, Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition, SIAM, 1994.
    [39]
    V. Aslot, M. Domeika, R. Eigenmann, G. Gaertner, W. B. Jones, and B. Parady, SPEComp: A new benchmark suite for measuring parallel computer performance, WOMPEI, 2001.
    [40]
    https://www.nersc.gov/users/computational-systems/cori/nersc-8-procurement/trinity-nersc-8-rfp/nersc-8-trinity-benchmarks.
    [41]
    D J. Craik, A .Kumar, G. C. Levy, MOLDYN: a generalized program for the evaluation of molecular dynamics models using nuclear magnetic resonance spin-relaxation data, J. Chem. Inf. Comput. Sci., 1983.
    [42]
    https://software.sandia.gov/hpcg/html/index.html.
    [43]
    C. Kim, D. Burger, and S. Keckler, An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches, ASPLOS, 2002.
    [44]
    J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach, 4th Edition, Morgan Kaufmann, 2012.

    Cited By

    View all
    • (2019)Morton filters: fast, compressed sparse cuckoo filtersThe VLDB Journal10.1007/s00778-019-00561-0Online publication date: 6-Aug-2019
    • (2018)ColumnMLProceedings of the VLDB Endowment10.14778/3297753.329775612:4(348-361)Online publication date: 1-Dec-2018
    • (2016)Power-efficient breadth-first search with DRAM row buffer locality-aware address mappingProceedings of the First International Workshop on High Performance Graph Data Management and Processing10.5555/3018830.3018833(17-24)Online publication date: 13-Nov-2016
    • Show More Cited By

    Index Terms

    1. Memory Row Reuse Distance and its Role in Optimizing Application Performance

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM SIGMETRICS Performance Evaluation Review
      ACM SIGMETRICS Performance Evaluation Review  Volume 43, Issue 1
      Performance evaluation review
      June 2015
      468 pages
      ISSN:0163-5999
      DOI:10.1145/2796314
      Issue’s Table of Contents
      • cover image ACM Conferences
        SIGMETRICS '15: Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems
        June 2015
        488 pages
        ISBN:9781450334860
        DOI:10.1145/2745844
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 15 June 2015
      Published in SIGMETRICS Volume 43, Issue 1

      Check for updates

      Author Tags

      1. memory scheduling
      2. multicores
      3. row reuse distance
      4. row-buffer locality

      Qualifiers

      • Research-article

      Funding Sources

      • NSF
      • Intel Inc.

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)15
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 10 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2019)Morton filters: fast, compressed sparse cuckoo filtersThe VLDB Journal10.1007/s00778-019-00561-0Online publication date: 6-Aug-2019
      • (2018)ColumnMLProceedings of the VLDB Endowment10.14778/3297753.329775612:4(348-361)Online publication date: 1-Dec-2018
      • (2016)Power-efficient breadth-first search with DRAM row buffer locality-aware address mappingProceedings of the First International Workshop on High Performance Graph Data Management and Processing10.5555/3018830.3018833(17-24)Online publication date: 13-Nov-2016
      • (2024)TAO: Re-Thinking DL-based Microarchitecture SimulationProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36560128:2(1-25)Online publication date: 29-May-2024
      • (2021)Distance-in-time versus distance-in-spaceProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454069(665-680)Online publication date: 19-Jun-2021
      • (2021)BlockHammer: Preventing RowHammer at Low Cost by Blacklisting Rapidly-Accessed DRAM Rows2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00037(345-358)Online publication date: Feb-2021
      • (2020)Enhancing Address Translations in Throughput Processors via CompressionProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414633(191-204)Online publication date: 30-Sep-2020
      • (2019)Architecture-Aware Approximate ComputingACM SIGMETRICS Performance Evaluation Review10.1145/3376930.337694647:1(23-24)Online publication date: 17-Dec-2019
      • (2019)Architecture-Aware Approximate ComputingProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/3341617.33261533:2(1-24)Online publication date: 19-Jun-2019
      • (2019)Co-optimizing memory-level parallelism and cache-level parallelismProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314599(935-949)Online publication date: 8-Jun-2019
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media