research-article

Runtime-assisted cache coherence deactivation in task parallel programs

Authors:

Miquel Moretó,

Marc CasasAuthors Info & Claims

SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

Article No.: 35, Pages 1 - 12

https://doi.org/10.1109/SC.2018.00038

Published: 26 July 2019 Publication History

Abstract

With increasing core counts, the scalability of directory-based cache coherence has become a challenging problem. To reduce the area and power needs of the directory, recent proposals reduce its size by classifying data as private or shared, and disable coherence for private data. However, existing classification methods suffer from inaccuracies and require complex hardware support with limited scalability.

This paper proposes a hardware/software co-designed approach: the runtime system identifies data that is guaranteed by the programming model semantics to not require coherence and notifies the microarchitecture. The microarchitecture deactivates coherence for this private data and powers off unused directory capacity. Our proposal reduces directory accesses to just 26% of the baseline system, and supports a 64x smaller directory with only 2.8% performance degradation. By dynamically calibrating the directory size our proposal saves 86% of dynamic energy consumption in the directory without harming performance.

References

[1]

R. H. Dennard, F. H. Gaensslen, H. nien Yu, V. L. Rideout, E. Bassous, Andre, and R. Leblanc, "Design of ion-implanted MOSFETs with very small physical dimensions," IEEE Journal of Solid-State Circuits, vol. 9, no. 5, pp. 256--268, Oct. 1974.

[2]

D. J. Sorin, M. D. Hill, and D. A. Wood, A Primer on Memory Consistency and Cache Coherence, 1st ed. Morgan & Claypool Publishers, 2011.

Digital Library

[3]

N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, "Reactive nuca: Near-optimal block placement and replication in distributed caches," in International Symposium on Computer Architecture (ISCA), 2009, pp. 184--195.

Digital Library

[4]

D. Kim, J. Ahn, J. Kim, and J. Huh, "Subspace snooping: Filtering snoops with operating system support," in International Conference on Parallel Architectures and Compilation (PACT), 2010, pp. 111--122.

Digital Library

[5]

B. A. Cuesta, A. Ros, M. E. Gomez, A. Robles, and J. F. Duato, "Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks," in International Symposium on Computer Architecture (ISCA), 2011, pp. 93--104.

Digital Library

[6]

A. Ros and S. Kaxiras, "Complexity-effective multicore coherence," in International Conference on Parallel Architectures and Compilation (PACT), 2012, pp. 241--252.

Digital Library

[7]

A. Ros, M. Davari, and S. Kaxiras, "Hierarchical private/shared classification: The key to simple and efficient coherence for clustered cache hierarchies," in International Symposium on High Performance Computer Architecture (HPCA), 2015, pp. 186--197.

[8]

"OpenMP Application Program Interface. Version 4.0. July 2013."

[9]

A. R. Lebeck and D. A. Wood, "Dynamic self-invalidation: Reducing coherence overhead in shared-memory multiprocessors," in International Symposium on Computer Architecture (ISCA), 1995, pp. 48--59.

Digital Library

[10]

A. Esteve, A. Ros, A. Robles, M. E. Gomez, and J. Duato, "Tokentlb: A token-based page classification approach," in International Conference on Supercomputing (ICS), 2016, pp. 26:1--26:13.

Digital Library

[11]

A. Esteve, A. Ros, M. E. Gomez, A. Robles, and J. Duato, "Efficient tlb-based detection of private pages in chip multiprocessors," IEEE Transactions on Parallel and Distributed Systems, vol. 27, no. 3, pp. 748--761, Mar. 2016.

Digital Library

[12]

A. Esteve, A. Ros, M. E. Gomez, A. Robles, and J. Duato, "Tlb-based temporality-aware classification in cmps with multilevel tlbs," IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 8, pp. 2401--2413, Jan. 2017.

[13]

V. Papaefstathiou, M. G. Katevenis, D. S. Nikolopoulos, and D. Pnevmatikatos, "Prefetching and cache management using task lifetimes," in International Conference on Supercomputing (ICS), 2013, pp. 325--334.

Digital Library

[14]

M. Casas, M. Moreto, L. Alvarez, E. Castillo, D. Chasapis, T. Hayes, L. Jaulmes, O. Palomar, O. Unsal, A. Cristal et al., "Runtime-aware architectures," in International Conference on Parallel and Distributed Computing (Euro-Par), 2015, pp. 16--27.

[15]

M. Valero, M. Moreto, M. Casas, E. Ayguade, and J. Labarta, "Runtime-aware architectures: A first approach," International Journal on Supercomputing Frontiers and Innovations, vol. 1, no. 1, pp. 29--44, Jun. 2014.

Digital Library

[16]

M. Manivannan, V. Papaefstathiou, M. Pericas, and P. Stenström, "RADAR: runtime-assisted dead region management for last-level caches," in International Symposium on High Performance Computer Architecture (HPCA), 2016, pp. 644--656.

[17]

D. H. Albonesi, R. Balasubramonian, S. G. Dropsbo, S. Dwarkadas, E. G. Friedman, M. C. Huang, V. Kursun, G. Magklis, M. L. Scott, G. Semeraro, P. Bose, A. Buyuktosunoglu, P. W. Cook, and S. E. Schuster, "Dynamically tuning processor resources with adaptive processing," IEEE Computer, vol. 36, no. 12, pp. 49--58, Dec 2003.

Digital Library

[18]

P. Ranganathan, S. V. Adve, and N. P. Jouppi, "Reconfigurable caches and their application to media processing," in International Symposium on Computer Architecture (ISCA), 2000, pp. 214--224.

Digital Library

[19]

K. Varadarajan, S. K. Nandy, V. Sharda, B. Amrutur, R. R. Iyer, S. Makineni, and D. Newell, "Molecular caches: A caching structure for dynamic creation of application-specific heterogeneous cache regions," in International Symposium on Microarchitecture (MICRO), 2006, pp. 433--442.

Digital Library

[20]

M. Powell, S.-H. Yang, B. Falsafi, K. Roy, and T. N. Vijaykumar, "Gated-vdd: A circuit technique to reduce leakage in deep-submicron cache memories," in International Symposium on Low Power Electronics and Design (ISLPED), 2000, pp. 90--95.

Digital Library

[21]

N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The gem5 simulator," SIGARCH Computer Architure News, vol. 39, no. 2, pp. 1--7, Aug. 2011.

Digital Library

[22]

A. Duran, E. Ayguadé, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas, "OmpSs: A proposal for programming heterogeneous multi-core architectures," Parallel Processing Letters, vol. 21, no. 2, pp. 173--193, 2011.

[23]

S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, "McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures," in International Symposium on Microarchitecture (MICRO), 2009, pp. 469--480.

Digital Library

[24]

S. Xi, H. Jacobson, P. Bose, G.-Y. Wei, and D. Brooks, "Quantifying sources of error in McPAT and potential impacts on architectural studies," in International Symposium on High Performance Computer Architecture (HPCA), 2015, pp. 577--589.

[25]

N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, "CACTI 6.0: A tool to model large caches," 2009.

[26]

J. Balart, A. Duran, M. Gonzàlez, X. Martorell, E. Ayguadé, and J. Labarta, "Nanos mercurium: a research compiler for OpenMP," in European Workshop on OpenMP (EWOMP), 2004, pp. 103--109.

[27]

A. Gupta, W.-D. Weber, and T. Mowry, "Reducing memory and traffic requirements for scalable directory-based cache coherence schemes," in International Conference on Parallel Processing (ICPP), 1990, pp. 312--321.

[28]

J. H. Choi and K. H. Park, "Segment directory enhancing the limited directory cache coherence schemes," in International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing (IPPS/SPDP), 1999, pp. 258--267.

Digital Library

[29]

H. Zhao, A. Shriraman, and S. Dwarkadas, "Space: Sharing pattern-based directory coherence for multicore scalability," in International Conference on Parallel Architectures and Compilation (PACT), 2010, pp. 135--146.

Digital Library

[30]

D. Sanchez and C. Kozyrakis, "Scd: A scalable coherence directory with flexible sharer set encoding," in International Symposium on High Performance Computer Architecture (HPCA), 2012, pp. 1--12.

Digital Library

[31]

J. Zebchuk, V. Srinivasan, M. K. Qureshi, and A. Moshovos, "A tagless coherence directory," in International Symposium on Microarchitecture (MICRO), 2009, pp. 423--434.

Digital Library

[32]

M. Ferdman, P. Lotfi-Kamran, K. Balet, and B. Falsafi, "Cuckoo directory: A scalable directory for many-core systems," in International Symposium on High Performance Computer Architecture (HPCA), 2011, pp. 169--180.

Digital Library

[33]

A. Moshovos, "Regionscout: Exploiting coarse grain sharing in snoop-based coherence," in International Symposium on Computer Architecture (ISCA), 2005, pp. 234--245.

Digital Library

[34]

J. F. Cantin, M. H. Lipasti, and J. E. Smith, "Improving multiprocessor performance with coarse-grain coherence tracking," in International Symposium on Computer Architecture (ISCA), 2005, pp. 246--257.

Digital Library

[35]

M. Alisafaee, "Spatiotemporal coherence tracking," in International Symposium on Microarchitecture (MICRO), 2012, pp. 341--350.

Digital Library

[36]

J. Zebchuk, B. Falsafi, and A. Moshovos, "Multi-grain coherence directories," in International Symposium on Microarchitecture (MICRO), 2013, pp. 359--370.

Digital Library

[37]

"Programmer's Guide for ARMv8-A. Version 1.0. 2015."

[38]

B. Cuesta, A. Ros, M. E. Gomez, A. Robles, and J. Duato, "Increasing the effectiveness of directory caches by avoiding the tracking of noncoherent memory blocks," IEEE Transactions on Computers, vol. 62, no. 3, pp. 482--495, Mar. 2013.

Digital Library

[39]

P.-A. Tsai, N. Beckmann, and D. Sanchez, "Jenga: Software-defined cache hierarchies," in International Symposium on Computer Architecture (ISCA), 2017, pp. 652--665.

Digital Library

[40]

R. L. Bocchino, Jr., V. S. Adve, D. Dig, S. V. Adve, S. Heumann, R. Komuravelli, J. Overbey, P. Simmons, H. Sung, and M. Vakilian, "A type and effect system for deterministic parallel java," in Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2009, pp. 97--116.

Digital Library

[41]

B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, S. V. Adve, V. S. Adve, N. P. Carter, and C. Chou, "Denovo: Rethinking the memory hierarchy for disciplined parallelism," in International Conference on Parallel Architectures and Compilation (PACT), 2011, pp. 155--166.

Digital Library

[42]

H. Sung, R. Komuravelli, and S. V. Adve, "Denovond: efficient hardware support for disciplined non-determinism," in International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2013, pp. 138--148.

Digital Library

[43]

Y. Li, A. Abousamra, R. Melhem, and A. K. Jones, "Compiler-assisted data distribution for chip multiprocessors," in International Conference on Parallel Architectures and Compilation (PACT), 2010, pp. 501--512.

Digital Library

[44]

Y. Li, R. Melhem, and A. K. Jones, "Practically private: Enabling high performance cmps through compiler-assisted data classification," in International Conference on Parallel Architectures and Compilation (PACT), 2012, pp. 231--240.

Digital Library

[45]

S. Zuckerman, J. Suetterlein, R. Knauerhase, and G. R. Gao, "Using a "Codelet" program execution model for exascale machines: Position paper," in Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era (EXADAPT), 2011, pp. 64--69.

Digital Library

[46]

L. V. Kale and S. Krishnan, "CHARM++: A portable concurrent object oriented system based on C++," in Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 1993, pp. 91--108.

Digital Library

[47]

C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier, "StarPU: A unified platform for task scheduling on heterogeneous multicore architectures," in International Conference on Parallel and Distributed Computing (Euro-Par), 2009, pp. 863--874.

Digital Library

[48]

M. Bauer, S. Treichler, E. Slaughter, and A. Aiken, "Legion: Expressing locality and independence with logical regions," in International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2012, pp. 66:1--66:11.

Digital Library

[49]

K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan, "Sequoia: Programming the memory hierarchy," in International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2006, pp. 83:1--83:11.

Digital Library

[50]

J. Shirako, J. M. Zhao, V. K. Nandivada, and V. N. Sarkar, "Chunking parallel loops in the presence of synchronization," in International Conference on Supercomputing (ICS), 2009, pp. 181--192.

Digital Library

[51]

M. Aldinucci, M. Danelutto, P. Kilpatrick, M. Meneghin, and M. Torquati, "Accelerating code on multi-cores with fastflow," in International Conference on Parallel and Distributed Computing (Euro-Par), 2011, pp. 170--181.

Digital Library

[52]

W. Thies, M. Karczmarek, and S. P. Amarasinghe, "Streamit: A language for streaming applications," in International Conference on Compiler Construction (CC), 2002, pp. 179--196.

Digital Library

[53]

J. C. Beard, P. Li, and R. D. Chamberlain, "Raftlib: A c++ template library for high performance stream parallel processing," in International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM), 2015, pp. 96--105.

Digital Library

[54]

"The OpenACC Application Programming Interface. Version 2.5. October 2015."

[55]

R. Dolbeau, S. Bihan, and F. Bodin, "Hmpp: A hybrid multi-core parallel programming environment," in Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU), 2007.

[56]

R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou, "Cilk: An efficient multithreaded runtime system," in Symposium on Principles and Practice of Parallel Programming (PPoPP), 1995, pp. 207--216.

Digital Library

[57]

J. Reinders, Intel threading building blocks - outfitting C++ for multi-core processor parallelism. O'Reilly Media, 2007.

Digital Library

[58]

J. Planas, R. M. Badia, E. Ayguade, and J. Labarta, "Self-adaptive OmpSs tasks in heterogeneous environments," in International Parallel and Distributed Processing Symposium (IPDPS), 2013, pp. 138--149.

Digital Library

[59]

L. Alvarez, M. Casas, J. Labarta, E. Ayguade, M. Valero, and M. Moreto, "Runtime-guided management of stacked dram memories in task parallel programs," in International Conference on Supercomputing (ICS), 2018, pp. 379--391.

Digital Library

[60]

J. Bueno, X. Martorell, R. M. Badia, E. Ayguadé, and J. Labarta, "Implementing OmpSs support for regions of data in architectures with multiple address spaces," in International Conference on Supercomputing (ICS), 2013, pp. 359--368.

Digital Library

[61]

L. Alvarez, M. Moreto, M. Casas, E. Castillo, X. Martorell, J. Labarta, E. Ayguade, and M. Valero, "Runtime-guided management of scratchpad memories in multicore architectures," in International Conference on Parallel Architectures and Compilation (PACT), 2015, pp. 379--391.

Digital Library

[62]

I. Brumar, M. Casas, M. Moretó, M. Valero, and G. S. Sohi, "ATM: approximate task memoization in the runtime system," in International Parallel and Distributed Processing Symposium (IPDPS), 2017, pp. 1140--1150.

[63]

E. Castillo, M. Moretó, M. Casas, L. Alvarez, E. Vallejo, K. Chronaki, R. M. Badia, J. L. Bosque, R. Beivide, E. Ayguadé, J. Labarta, and M. Valero, "CATA: criticality aware task acceleration for multicore processors," in International Parallel and Distributed Processing Symposium (IPDPS), 2016, pp. 413--422.

[64]

P. Caheny, L. Alvarez, S. Derradji, M. Valero, M. Moretó, and M. Casas, "Reducing cache coherence traffic with a numa-aware runtime approach," IEEE Transactions on Parallel and Distributed Systems, vol. 29, no. 5, pp. 1174--1187, May 2018.

[65]

P. Caheny, M. Casas, M. Moretó, H. Gloaguen, M. Saintes, E. Ayguadé, J. Labarta, and M. Valero, "Reducing cache coherence traffic with hierarchical directory cache and numa-aware runtime scheduling," in International Conference on Parallel Architectures and Compilation (PACT), 2016, pp. 275--286.

Digital Library

[66]

M. Manivannan, A. Negi, and P. Stenström, "Efficient forwarding of producer-consumer data in task-based programs," in International Conference on Parallel Processing (ICPP), 2013, pp. 517--522.

Digital Library

Cited By

Upadhyay BRos AM. S(2023)Fine-grain data classification to filter token coherence trafficJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.09.004171:C(40-53)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1016/j.jpdc.2022.09.004

Recommendations

Runtime-assisted cache coherence deactivation in task parallel programs
SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

With increasing core counts, the scalability of directory-based cache coherence has become a challenging problem. To reduce the area and power needs of the directory, recent proposals reduce its size by classifying data as private or shared, and disable ...
The Effect of Code Expanding Optimizations on Instruction Cache Design

Shows that code expanding optimizations have strong and nonintuitive implications on instruction cache design. Three types of code expanding optimizations are studied in this paper: instruction placement, function inline expansion, and superscalar ...
CPU Cache Prefetching: Timing Evaluation of Hardware Implementations

Prefetching into CPU caches has long been known to be effective in reducing the cache miss ratio, but known implementations of prefetching have been unsuccessful in improving CPU performance. The reasons for this are that prefetches interfere with ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

November 2018

932 pages

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

IEEE Press

Publication History

Published: 26 July 2019

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC18

Sponsor:

SIGHPC

SC18: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 11 - 16, 2018

Texas, Dallas

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
42
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Upadhyay BRos AM. S(2023)Fine-grain data classification to filter token coherence trafficJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.09.004171:C(40-53)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1016/j.jpdc.2022.09.004

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten