research-article

Memory Row Reuse Distance and its Role in Optimizing Application Performance

Authors:

Mahmut Kandemir,

Mustafa KarakoyAuthors Info & Claims

ACM SIGMETRICS Performance Evaluation Review, Volume 43, Issue 1

Pages 137 - 149

https://doi.org/10.1145/2796314.2745867

Published: 15 June 2015 Publication History

Abstract

Continuously increasing dataset sizes of large-scale applications overwhelm on-chip cache capacities and make the performance of last-level caches (LLC) increasingly important. That is, in addition to maximizing LLC hit rates, it is becoming equally important to reduce LLC miss latencies. One of the critical factors that influence LLC miss latencies is row-buffer locality (i.e., the fraction of LLC misses that hit in the large buffer attached to a memory bank). While there has been a plethora of recent works on optimizing row-buffer performance, to our knowledge, there is no study that quantifies the full potential of row-buffer locality and impact of maximizing it on application performance.

Focusing on multithreaded applications, the first contribution of this paper is the definition of a new metric called (memory) row reuse distance (RRD). We show that, while intra-core RRDs are relatively small (increasing the chances for row-buffer hits), inter-core RRDs are quite large (increasing the chances for row-buffer misses). Motivated by this, we propose two schemes that measure the maximum potential benefits that could be obtained from minimizing RRDs, to the extent allowed by program dependencies. Specifically, one of our schemes (Scheme-I) targets only intra-core RRDs, whereas the other one (Scheme-II) aims at reducing both intra-core RRDs and inter-core RRDs. Our experimental evaluations demonstrate that (i) Scheme-I reduces intra-core RRDs but increases inter-core RRDs; (ii) Scheme-II reduces inter-core RRDs significantly while achieving a similar behavior to Scheme-I as far as intra-core RRDs are concerned; (iii) Scheme-I and Scheme-II improve execution times of our applications by 17% and 21%, respectively, on average; and (iv) both our schemes deliver consistently good results under different memory request scheduling policies.

References

[1]

M. Xie, D. Tong, K. Huang and X. Cheng, Improving system throughput and fairness simultaneously in shared memory CMP systems via Dynamic Bank Partitioning, HPCA, 2014.

[2]

L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu, The Blacklisting Memory Scheduler:Achieving High Performance and Fairness at Low Cost, ICCD, 2014.

[3]

B. T. Davis, Modern DRAM Architectures. PhD thesis, University of Michigan, 2000.

Digital Library

[4]

W. Ding, D. Guttman and M. Kandemir, Compiler Support for Optimizing Memory Bank-Level Parallelism, MICRO, 2014.

Digital Library

[5]

S. O,Y. H. Son, N. S. Kim and J. H. Ahn, Row-buffer decoupling: a case for low-latency DRAM microarchitecture, ISCA, 2014.

[6]

D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting inter-thread cache contention on a chip multi-processor architecture, HPCA, 2005.

Digital Library

[7]

J. Chang and G. S. Sohi, Cooperative cache partitioning for chip multiprocessors, ICS, 2007.

Digital Library

[8]

A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. Steely Jr. and J. Emer, Adaptive insertion policies for managing shared caches, PACT, 2008.

Digital Library

[9]

M. Kandemir, S. P. Muralidhara, S. H. K. Narayanan, Y. Zhang, O. Ozturk, Optimizing shared cache behaviorof chip multiprocessors, MICRO, 2009.

Digital Library

[10]

S. Kim, D. Chandra and Y. Solihin Fair cache sharing and partitioning in achip multiprocessor architecture, PACT, 2004.

Digital Library

[11]

S. Rixner, Memory controller optimizations for web servers, MICRO, 2004.

Digital Library

[12]

S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens, Memory access scheduling, ISCA, 2000.

Digital Library

[13]

Z. Zhang, Z. Zhu, and X. Zhang, A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality, MICRO, 2000.

Digital Library

[14]

S. M. Zahedi, and B. C. Lee, REF: resource elasticity fairness with sharing incentives for multiprocessors, ASPLOS, 2014.

Digital Library

[15]

H. Wang, R. Singh, M. J. Schulte, and N. S. Kim, Memory scheduling towards high-throughput cooperative heterogeneous computing, PACT, 2014.

Digital Library

[16]

J. Hasan, S. Chandra, and T. N. Vijaykumar, Efficient Use of Memory Bandwidth to Improve Network Processor Throughput, ISCA, 2003.

Digital Library

[17]

H. Yoon, J. Meza, R. Ausavarungnirun, R. A. Harding and O. Mutlu, Row Buffer Locality Aware Caching Policies for Hybrid Memories, ICCD, 2012.

Digital Library

[18]

K. Sudan, N. Chatterjee, D. Nellans, M. Awasthi, R. Balasubramonian and A. Davis, Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement, ASPLOS, 2010.

Digital Library

[19]

Y. Zhang, M. T. Kandemir and T. Yemliha, Studying inter-core data reuse in multicores, SIGMETRICS, 2011.

Digital Library

[20]

Y. Kim, D. Han, O. Mutlu and M. Harchol-Balter, ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers, HPCA, 2010.

[21]

JEDEC Solid State Technology Association, DDR3 SDRAM Specification, JESD79--3D edition, Sept, 2009

[22]

Calculating Memory System Power for DDR3, Technical report, Micron Technology Inc., 2--7, TN-4-01, 2007.

[23]

K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith. Fair queuing memory systems, MICRO, 2006.

Digital Library

[24]

O. Mutlu and T. Moscibroda. Stall-time fair memory access scheduling for chip multiprocessor, MICRO, 2007.

Digital Library

[25]

O. Mutlu and T. Moscibroda, Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems, ISCA, 2008.

Digital Library

[26]

T. E. Carlson, W. Heirman, and L. Eeckhout, Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulations, SC, 2011.

Digital Library

[27]

D. Chen and Y. Zhong, Predicting whole-program locality through reuse distance analysis, PLDI, 2003.

Digital Library

[28]

G. Keramidas, P. Petoumenos and S. Kaxiras, Cache Replacement Based on Reuse-Distance Prediction, ICCD, 2007.

[29]

A. Jaleel, K. B. Theobald, S. C. Steely Jr. and J. Emer, High Performance Cache Replacement Using Re-Reference Interval Prediction, ISCA, 2007.

Digital Library

[30]

K. Beyls and E. H. D'Hollander, Reuse distance as a metric for cache behavior, IPDCS, 2001.

[31]

G. Almasi, C. Cascaval and D. A. Padua, Calculating stack distances efficiently, SIGPLAN Not., 2003

Digital Library

[32]

Y. Jiang, E. Z. Zhang, K. Tian, X. Shen, Is reuse distance applicable to data locality analysis on chip multiprocessors?, Compiler Construction, 2010.

Digital Library

[33]

M. Kandemir, A compiler technique for improving whole-program locality, POPL, 2001.

Digital Library

[34]

D. L. Schuff, M. Kulkarni, and V. S. Pai, Accelerating multicore reuse distance analysis with sampling and parallelization, PACT, 2010.

Digital Library

[35]

Y. Kim, M. Papamichael, O. Mutlu and M. Harchol-Balter, Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior, MICRO, 2010.

Digital Library

[36]

M. Awasthi, D. Nellans, K. Sudan, R. Balasubramonian and A. Davis, Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers, PACT, 2010.

Digital Library

[37]

H. Park, S. Baek, J. Choi, D. Lee and S. Noh, Regularities considered harmful: forcing randomness to memory accesses to reduce row-buffer conflicts for multi-core, multi-bank systems, ASPLOS, 2013.

Digital Library

[38]

R. Barrett, R. Barrett, M. Berry3, T. F. Chan, J. Demmel, J. M. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. Van der Vorst, Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition, SIAM, 1994.

[39]

V. Aslot, M. Domeika, R. Eigenmann, G. Gaertner, W. B. Jones, and B. Parady, SPEComp: A new benchmark suite for measuring parallel computer performance, WOMPEI, 2001.

Digital Library

[40]

https://www.nersc.gov/users/computational-systems/cori/nersc-8-procurement/trinity-nersc-8-rfp/nersc-8-trinity-benchmarks.

[41]

D J. Craik, A .Kumar, G. C. Levy, MOLDYN: a generalized program for the evaluation of molecular dynamics models using nuclear magnetic resonance spin-relaxation data, J. Chem. Inf. Comput. Sci., 1983.

[42]

https://software.sandia.gov/hpcg/html/index.html.

[43]

C. Kim, D. Burger, and S. Keckler, An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches, ASPLOS, 2002.

Digital Library

[44]

J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach, 4th Edition, Morgan Kaufmann, 2012.

Digital Library

Cited By

Breslow AJayasena N(2019)Morton filters: fast, compressed sparse cuckoo filtersThe VLDB Journal10.1007/s00778-019-00561-0Online publication date: 6-Aug-2019
https://doi.org/10.1007/s00778-019-00561-0
Kara KEguro KZhang CAlonso G(2018)ColumnMLProceedings of the VLDB Endowment10.14778/3297753.329775612:4(348-361)Online publication date: 1-Dec-2018
https://dl.acm.org/doi/10.14778/3297753.3297756
Imamura SYasui YInoue KOno TSasaki HFujisawa KSuzumura TGarcia-Gasulla DDayarathna MShun J(2016)Power-efficient breadth-first search with DRAM row buffer locality-aware address mappingProceedings of the First International Workshop on High Performance Graph Data Management and Processing10.5555/3018830.3018833(17-24)Online publication date: 13-Nov-2016
https://dl.acm.org/doi/10.5555/3018830.3018833
Show More Cited By

Index Terms

Memory Row Reuse Distance and its Role in Optimizing Application Performance
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory

Recommendations

Memory Row Reuse Distance and its Role in Optimizing Application Performance
SIGMETRICS '15: Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems

Continuously increasing dataset sizes of large-scale applications overwhelm on-chip cache capacities and make the performance of last-level caches (LLC) increasingly important. That is, in addition to maximizing LLC hit rates, it is becoming equally ...
Harvesting Row-Buffer Hits via Orchestrated Last-Level Cache and DRAM Scheduling for Heterogeneous Multicore Systems

In heterogeneous multicore systems, the memory subsystem, including the last-level cache and DRAM, is widely shared among the CPU, the GPU, and the real-time cores. Due to their distinct memory traffic patterns, heterogeneous cores result in more ...
Reuse distance based performance modeling and workload mapping
CF '12: Proceedings of the 9th conference on Computing Frontiers

Modern multicore architectures have multiple cores connected to a hierarchical cache structure resulting in heterogeneity in cache sharing across different subsets of cores. In these systems, overall throughput and efficiency depends heavily on a ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGMETRICS Performance Evaluation Review

ACM SIGMETRICS Performance Evaluation Review Volume 43, Issue 1

Performance evaluation review

June 2015

468 pages

ISSN:0163-5999

DOI:10.1145/2796314

Editors:
Derek Eager
University of Saskatchewan
,
Carey Williamson
University of Calgary

Issue’s Table of Contents

SIGMETRICS '15: Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems
June 2015
488 pages
ISBN:9781450334860
DOI:10.1145/2745844
General Chairs:
Bill Lin
University of California, San Diego
,
Jun (Jim) Xu
Georgia Tech
,
Program Chairs:
Sudipta Sengupta
Microsoft Research
,
Devavrat Shah
Massachusetts Institute of Technology

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 June 2015

Published in SIGMETRICS Volume 43, Issue 1

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF
Intel Inc.

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

30
Total Citations
View Citations
449
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)1

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Breslow AJayasena N(2019)Morton filters: fast, compressed sparse cuckoo filtersThe VLDB Journal10.1007/s00778-019-00561-0Online publication date: 6-Aug-2019
https://doi.org/10.1007/s00778-019-00561-0
Kara KEguro KZhang CAlonso G(2018)ColumnMLProceedings of the VLDB Endowment10.14778/3297753.329775612:4(348-361)Online publication date: 1-Dec-2018
https://dl.acm.org/doi/10.14778/3297753.3297756
Imamura SYasui YInoue KOno TSasaki HFujisawa KSuzumura TGarcia-Gasulla DDayarathna MShun J(2016)Power-efficient breadth-first search with DRAM row buffer locality-aware address mappingProceedings of the First International Workshop on High Performance Graph Data Management and Processing10.5555/3018830.3018833(17-24)Online publication date: 13-Nov-2016
https://dl.acm.org/doi/10.5555/3018830.3018833
Pandey SYazdanbakhsh ALiu H(2024)TAO: Re-Thinking DL-based Microarchitecture SimulationProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36560128:2(1-25)Online publication date: 29-May-2024
https://dl.acm.org/doi/10.1145/3656012
Kandemir MTang XZhao HRyoo JKarakoy MFreund SYahav E(2021)Distance-in-time versus distance-in-spaceProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454069(665-680)Online publication date: 19-Jun-2021
https://dl.acm.org/doi/10.1145/3453483.3454069
Yaglikci APatel MKim JAzizi ROlgun AOrosa LHassan HPark JKanellopoulos KShahroodi TGhose SMutlu O(2021)BlockHammer: Preventing RowHammer at Low Cost by Blacklisting Rapidly-Accessed DRAM Rows2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00037(345-358)Online publication date: Feb-2021
https://doi.org/10.1109/HPCA51647.2021.00037
Tang XZhang ZXu WKandemir MMelhem RYang JSarkar VKim H(2020)Enhancing Address Translations in Throughput Processors via CompressionProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414633(191-204)Online publication date: 30-Sep-2020
https://dl.acm.org/doi/10.1145/3410463.3414633
Karakoy MKislal OTang XKandemir MArunachalam M(2019)Architecture-Aware Approximate ComputingACM SIGMETRICS Performance Evaluation Review10.1145/3376930.337694647:1(23-24)Online publication date: 17-Dec-2019
https://dl.acm.org/doi/10.1145/3376930.3376946
Karakoy MKislal OTang XKandemir MArunachalam M(2019)Architecture-Aware Approximate ComputingProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/3341617.33261533:2(1-24)Online publication date: 19-Jun-2019
https://dl.acm.org/doi/10.1145/3341617.3326153
Tang XKandemir MKarakoy MArunachalam MMcKinley KFisher K(2019)Co-optimizing memory-level parallelism and cache-level parallelismProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314599(935-949)Online publication date: 8-Jun-2019
https://dl.acm.org/doi/10.1145/3314221.3314599
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents