research-article

Open access

LDAC: Locality-Aware Data Access Control for Large-Scale Multicore Cache Hierarchies

Authors:

Srinivas Devadas,

Omer KhanAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 13, Issue 4

Article No.: 37, Pages 1 - 28

https://doi.org/10.1145/2983632

Published: 15 November 2016 Publication History

Abstract

The trend of increasing the number of cores to achieve higher performance has challenged efficient management of on-chip data. Moreover, many emerging applications process massive amounts of data with varying degrees of locality. Therefore, exploiting locality to improve on-chip traffic and resource utilization is of fundamental importance. Conventional multicore cache management schemes either manage the private cache (L1) or the Last-Level Cache (LLC), while ignoring the other. We propose a holistic locality-aware cache hierarchy management protocol for large-scale multicores. The proposed scheme improves on-chip data access latency and energy consumption by intelligently bypassing cache line replication in the L1 caches, and/or intelligently replicating cache lines in the LLC. The approach relies on low overhead yet highly accurate in-hardware runtime classification of data locality at both L1 cache and the LLC. The decision to bypass L1 and/or replicate in LLC is then based on the measured reuse at the fine granularity of cache lines. The locality tracking mechanism is decoupled from the sharer tracking structures that cause scalability concerns in traditional cache coherence protocols. Moreover, the complexity of the protocol is low since no additional coherence states are created. However, the proposed classifier incurs a 5.6KB per-core storage overhead. On a set of parallel benchmarks, the locality-aware protocol reduces average energy consumption by 26% and completion time by 16%, when compared to the state-of-the-art Reactive-NUCA multicore cache management scheme.

References

[1]

Anant Agarwal, Richard Simoni, John L. Hennessy, and Mark Horowitz. 1988. An evaluation of directory schemes for cache coherence. In International Symposium on Computer Architecture.

Digital Library

[2]

Masab Ahmad, Farrukh Hijaz, Qingchuan Shi, and Omer Khan. 2015. CRONO: A benchmark suite for multithreaded graph algorithms executing on futuristic multicores. In IEEE International Symposium on Workload Characterization (IISWC).

Digital Library

[3]

Bradford M. Beckmann, Michael R. Marty, and David A. Wood. 2006. ASR: Adaptive selective replication for CMP caches. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 39). IEEE Computer Society, Washington, DC, 443--454.

Digital Library

[4]

Nathan Beckmann and Daniel Sanchez. 2013. Jigsaw: Scalable software-defined caches. In International Conference on Parallel Architectures and Compilation Techniques (PACT).

Digital Library

[5]

Shane Bell, Bruce Edwards, John Amann, Rich Conlin, Kevin Joyce, Vince Leung, John MacKay, Mike Reif, Liewei Bao, John Brown, Matthew Mattina, Chyi-Chang Miao, Carl Ramey, David Wentzlaff, Walker Anderson, Ethan Berger, Nat Fairbanks, Durlov Khan, Froilan Montenegro, Jay Stickney, and John Zook. 2008. TILE64 - Processor: A 64-core SoC with mesh interconnect. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC 2008). Digest of Technical Papers. 88--598.

[6]

Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In International Conference on Parallel Architectures and Compilation Techniques.

Digital Library

[7]

Shekhar Borkar. 2007. Thousand core chips: A technology perspective. In Proceedings of the 44th Annual Design Automation Conference (DAC’07). ACM, New York, NY, 746--749. 10.1145/1278480.1278667

Digital Library

[8]

Lucien M. Censier and Paul Feautrier. 1978. A new solution to coherence problems in multicache systems. IEEE Trans. Comput. 27, 12 (Dec. 1978), 1112--1118.

Digital Library

[9]

Jichuan Chang and G. S. Sohi. 2006. Cooperative caching for chip multiprocessors. In Proceedings of the 33rd International Symposium on Computer Architecture (ISCA’06). 264--276. 10.1109/ISCA.2006.17

Digital Library

[10]

Zeshan Chishti, Michael D. Powell, and T. N. Vijaykumar. 2005. Optimizing replication, communication, and capacity allocation in CMPs. In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA’05). IEEE Computer Society, Washington, DC, 357--368. 10.1109/ISCA.2005.39

Digital Library

[11]

William J. Dally and Brian Towles. 2004. Principles and Practices of Interconnection Networks. Morgan Kaufmann.

Digital Library

[12]

Ronald G. Dreslinski, David Fick, Bharan Giridhar, Gyouho Kim, Sangwon Seo, Matthew Fojtik, Sudhir Satpathy, Yoonmyung Lee, Daeyeon Kim, Nurrachman Liu, Michael Wieckowski, Gregory Chen, Dennis Sylvester, David Blaauw, and Trevor Mudge. 2013. Centip3De: A 64-core, 3D stacked near-threshold system. IEEE Micro 33, 2 (2013), 8--16.

Digital Library

[13]

Noel Eisley, Li-Shiuan Peh, and Li Shang. 2006. In-network cache coherence. In IEEE/ACM International Symposium on Microarchitecture (MICRO 39). 321--332.

Digital Library

[14]

Christian Fensch and Marcelo Cintra. 2008. An OS-based alternative to full hardware coherence on tiled CMPs. In Proceedings of the IEEE 14th International Symposium on High Performance Computer Architecture (HPCA 2008). 355--366.

[15]

Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki. 2009. Reactive NUCA: Near-optimal block placement and replication in distributed caches. In International Symposium on Computer Architecture.

Digital Library

[16]

Enric Herrero, José González, and Ramon Canal. 2010. Elastic cooperative caching: An autonomous dynamically adaptive memory hierarchy for chip multiprocessors. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). ACM, New York, NY, 419--428.

Digital Library

[17]

Hemet Hossain, Sandhya Dwarkadas, and Michael C. Huang. 2011. POPS: Coherence protocol optimization for both private and shared data. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT).

Digital Library

[18]

Syed Muhammad Zeeshan Iqbal, Yuchen Liang, and Hakan Grahn. 2010. ParMiBench - An open-source benchmark for embedded multiprocessor systems. Computer Architecture Letters (2010).

Digital Library

[19]

Aamer Jaleel, Kevin B. Theobald, Simon C. Steely, Jr., and Joel Emer. 2010. High performance cache replacement using re-reference interval prediction (RRIP). In International Symposium on Computer Architecture (ISCA).

Digital Library

[20]

Teresa L. Johnson and Wen-Mei W. Hwu. 1997. Run-time adaptive cache hierarchy management via reference analysis. In International Symposium on Computer Architecture.

Digital Library

[21]

Himanshu Kaul, Mark Anders, Steven Hsu, Amit Agarwal, Ram Krishnamurthy, and Shekhar Borkar. 2012. Near-threshold voltage (NTV) design: Opportunities and challenges. In Design Automation Conference.

Digital Library

[22]

George Kurian, Srinivas Devadas, and Omer Khan. 2014. Locality-aware data replication in the last-level cache. In Proceedings of the 2014 IEEE 120th International Symposium on High Performance Computer Architecture (HPCA 2014).

[23]

George Kurian, Omer Khan, and Srinivas Devadas. 2013. The locality-aware adaptive cache coherence protocol. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). ACM, New York, NY, 523--534.

Digital Library

[24]

George Kurian, Jason Miller, James Psota, Jonathan Eastep, Jifeng Liu, Jurgen Michel, Lionel Kimerling, and Anant Agarwal. 2010. ATAC: A 1000-core cache-coherent processor with on-chip optical network. In International Conference on Parallel Architectures and Compilation Techniques.

Digital Library

[25]

George Kurian, Qingchuan Shi, Srinivas Devadas, and Omer Khan. 2015. OSPREY: Implementation of memory consistency models for cache coherence protocols involving invalidation-free data access. In International Conference on Parallel Architectures and Compilation Techniques.

Digital Library

[26]

George Kurian, Chen Sun, Chia-Hsin Owen Chen, Jason E. Miller, Jurgen Michel, Lan Wei, Dimitri A. Antoniadis, Li-Shiuan Peh, Lionel Kimerling, Vladimir Stojanovic, and Anant Agarwal. 2012. Cross-layer energy and performance evaluation of a nanophotonic manycore processor system using real application workloads. In Proceedings of the 2012 IEEE 26th International Parallel Distributed Processing Symposium (IPDPS). 1117--1130.

Digital Library

[27]

Hyunjin Lee, Sangyeun Cho, and B. R. Childers. 2011. CloudCache: Expanding and shrinking private caches. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA). 219--230.

Digital Library

[28]

Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In MICRO.

Digital Library

[29]

Haiming Liu, Michael Ferdman, Jaehyuk Huh, and Doug Burger. 2008. Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency. In International Symposium on Microarchitecture.

Digital Library

[30]

Milo M. K. Martin, Mark D. Hill, and Daniel J. Sorin. 2012. Why on-chip cache coherence is here to stay. Commun. ACM 55, 7 (2012).

Digital Library

[31]

Javier Merino, Valentin Puente, and Jose A. Gregorio. 2010. ESP-NUCA: A low-cost adaptive non-uniform cache architecture. In Proceedings of the 2010 IEEE 16th International Symposium on High Performance Computer Architecture (HPCA). 1--10.

[32]

Jason E. Miller, Harshad Kasture, George Kurian, Charles Gruenwald, Nathan Beckmann, Christopher Celio, Jonathan Eastep, and Anant Agarwal. 2010. Graphite: A distributed parallel simulator for multicores. In HPCA. 1--12.

[33]

Sunghyun Park, Tushar Krishna, Chia-Hsin Chen, Bhavya Daya, Anantha Chandrakasan, and Li-Shiuan Peh. 2012. Approaching the theoretical limits of a mesh NoC with a 16-node chip prototype in 45nm SOI. In Design Automation Conference.

Digital Library

[34]

Michael Powell, Se-Hyun Yang, Babak Falsafi, Kaushik Roy, and T. N. Vijaykumar. 2000. Gated-Vdd: A circuit technique to reduce leakage in deep-submicron cache memories. In Proceedings of the 2000 International Symposium on Low Power Electronics and Design (ISLPED’00). ACM, New York, NY, 90--95.

Digital Library

[35]

Moinuddin K. Qureshi. 2009. Adaptive spill-receive for robust high-performance caching in CMPs. In Proceedings of the IEEE 15th International Symposium on High Performance Computer Architecture (HPCA 2009). 45--54.

[36]

Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely, and Joel Emer. 2007. Adaptive insertion policies for high performance caching. In International Symposium on Computer Architecture (ISCA).

Digital Library

[37]

Moinuddin K. Qureshi and Yale N. Patt. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In International Symposium on Microarchitecture (MICRO).

Digital Library

[38]

Daniel Sanchez and Christos Kozyrakis. 2012. SCD: A scalable coherence directory with flexible sharer set encoding. In International Symposium on High-Performance Computer Architecture.

Digital Library

[39]

Shekhar Srikantaiah, Emre Kultursay, Tao Zhang, Mahmut Kandemir, Mary Jane Irwin, and Yuan Xie. 2011. MorphCache: A reconfigurable adaptive multi-level cache hierarchy. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA). 231--242.

Digital Library

[40]

Chen Sun, Chia-Hsin Owen Chen, George Kurian, Lan Wei, Jason Miller, Anant Agarwal, Li-Shiuan Peh, and Vladimir Stojanovic. 2012. DSENT - A tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In International Symposium on Networks-on-Chip.

Digital Library

[41]

Shyamkumar Thoziyoor, Jung Ho Ahn, Matteo Monchiero, Jay B. Brockman, and Norman P. Jouppi. 2008. A comprehensive memory modeling tool and its application to the design and analysis of future memory hierarchies. In ISCA.

Digital Library

[42]

Gary Tyson, Matthew Farrens, John Matthews, and Andrew R. Pleszkun. 1995. A modified approach to data cache management. In International Symposium on Microarchitecture.

Digital Library

[43]

Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In International Conference on Computer Architecture.

Digital Library

[44]

Xiangyao Yu, George Bezerra, Andrew Pavlo, Srinivas Devadas, and Michael Stonebraker. 2014. Staring into the Abyss: An evaluation of concurrency control with one thousand cores. Proc. VLDB Endow. 8, 3 (Nov. 2014), 209--220.

Digital Library

[45]

Jason Zebchuk, Vijayalakshmi Srinivasan, Moinuddin K. Qureshi, and Andreas Moshovos. 2009. A tagless coherence directory. In International Symposium on Microarchitecture.

Digital Library

[46]

Hongzhou Zhao, Arrvindh Shriraman, and Sandhya Dwarkadas. 2010. SPACE: Sharing pattern-based directory coherence for multicore scalability. In International Conference on Parallel Architectures and Compilation Techniques. 135--146.

Digital Library

Cited By

Maurya ARafique MTonellot TAlSalem HCappello FNicolae BButt AMi NChard K(2023)GPU-Enabled Asynchronous Multi-level Checkpoint Caching and PrefetchingProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3592987(73-85)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3588195.3592987
Wu QJi Z(2021)A perceptron-based replication scheme for managing the shared last level cacheMicroprocessors & Microsystems10.1016/j.micpro.2021.10431085:COnline publication date: 1-Sep-2021
https://dl.acm.org/doi/10.1016/j.micpro.2021.104310
Holtryd NManivannan MStenstrom PPericas M(2020)DELTA: Distributed Locality-Aware Cache Partitioning for Tile-based Chip Multiprocessors2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00066(578-589)Online publication date: May-2020
https://doi.org/10.1109/IPDPS47924.2020.00066
Show More Cited By

Index Terms

LDAC: Locality-Aware Data Access Control for Large-Scale Multicore Cache Hierarchies
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures

Recommendations

SRM-buffer: an OS buffer management technique to prevent last level cache from thrashing in multicores
EuroSys '11: Proceedings of the sixth conference on Computer systems

Buffer caches in operating systems keep active file blocks in memory to reduce disk accesses. Related studies have been focused on how to minimize buffer misses and the caused performance degradation. However, the side effects and performance ...
Reactive NUCA: near-optimal block placement and replication in distributed caches

Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Reactive NUCA: near-optimal block placement and replication in distributed caches
ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture

Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 13, Issue 4

December 2016

648 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3012405

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2016

Accepted: 01 August 2016

Revised: 01 June 2016

Received: 01 December 2015

Published in TACO Volume 13, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Science Foundation
Semiconductor Research Corporation (SRC)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
519
Total Downloads

Downloads (Last 12 months)87
Downloads (Last 6 weeks)14

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Maurya ARafique MTonellot TAlSalem HCappello FNicolae BButt AMi NChard K(2023)GPU-Enabled Asynchronous Multi-level Checkpoint Caching and PrefetchingProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3592987(73-85)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3588195.3592987
Wu QJi Z(2021)A perceptron-based replication scheme for managing the shared last level cacheMicroprocessors & Microsystems10.1016/j.micpro.2021.10431085:COnline publication date: 1-Sep-2021
https://dl.acm.org/doi/10.1016/j.micpro.2021.104310
Holtryd NManivannan MStenstrom PPericas M(2020)DELTA: Distributed Locality-Aware Cache Partitioning for Tile-based Chip Multiprocessors2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00066(578-589)Online publication date: May-2020
https://doi.org/10.1109/IPDPS47924.2020.00066
Nicolae BMoody AGonsiorowski EMohror KCappello F(2019)VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2019.00099(911-920)Online publication date: May-2019
https://doi.org/10.1109/IPDPS.2019.00099
Hussain WAshraf MFerooz FButt AKhan Y(2019)An exposition on the applications of Locality Aware Scheduling algorithms2019 International Conference on Innovative Computing (ICIC)10.1109/ICIC48496.2019.8966718(1-6)Online publication date: Nov-2019
https://doi.org/10.1109/ICIC48496.2019.8966718
Wu QJi Z(2019)A Reuse-Degree Based Locality Classifier for Locality-Aware Data ReplicationIEEE Access10.1109/ACCESS.2019.29598407(182207-182216)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2019.2959840

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents