Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/SC.2018.00038acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Runtime-assisted cache coherence deactivation in task parallel programs

Published: 26 July 2019 Publication History

Abstract

With increasing core counts, the scalability of directory-based cache coherence has become a challenging problem. To reduce the area and power needs of the directory, recent proposals reduce its size by classifying data as private or shared, and disable coherence for private data. However, existing classification methods suffer from inaccuracies and require complex hardware support with limited scalability.
This paper proposes a hardware/software co-designed approach: the runtime system identifies data that is guaranteed by the programming model semantics to not require coherence and notifies the microarchitecture. The microarchitecture deactivates coherence for this private data and powers off unused directory capacity. Our proposal reduces directory accesses to just 26% of the baseline system, and supports a 64x smaller directory with only 2.8% performance degradation. By dynamically calibrating the directory size our proposal saves 86% of dynamic energy consumption in the directory without harming performance.

References

[1]
R. H. Dennard, F. H. Gaensslen, H. nien Yu, V. L. Rideout, E. Bassous, Andre, and R. Leblanc, "Design of ion-implanted MOSFETs with very small physical dimensions," IEEE Journal of Solid-State Circuits, vol. 9, no. 5, pp. 256--268, Oct. 1974.
[2]
D. J. Sorin, M. D. Hill, and D. A. Wood, A Primer on Memory Consistency and Cache Coherence, 1st ed. Morgan & Claypool Publishers, 2011.
[3]
N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, "Reactive nuca: Near-optimal block placement and replication in distributed caches," in International Symposium on Computer Architecture (ISCA), 2009, pp. 184--195.
[4]
D. Kim, J. Ahn, J. Kim, and J. Huh, "Subspace snooping: Filtering snoops with operating system support," in International Conference on Parallel Architectures and Compilation (PACT), 2010, pp. 111--122.
[5]
B. A. Cuesta, A. Ros, M. E. Gomez, A. Robles, and J. F. Duato, "Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks," in International Symposium on Computer Architecture (ISCA), 2011, pp. 93--104.
[6]
A. Ros and S. Kaxiras, "Complexity-effective multicore coherence," in International Conference on Parallel Architectures and Compilation (PACT), 2012, pp. 241--252.
[7]
A. Ros, M. Davari, and S. Kaxiras, "Hierarchical private/shared classification: The key to simple and efficient coherence for clustered cache hierarchies," in International Symposium on High Performance Computer Architecture (HPCA), 2015, pp. 186--197.
[8]
"OpenMP Application Program Interface. Version 4.0. July 2013."
[9]
A. R. Lebeck and D. A. Wood, "Dynamic self-invalidation: Reducing coherence overhead in shared-memory multiprocessors," in International Symposium on Computer Architecture (ISCA), 1995, pp. 48--59.
[10]
A. Esteve, A. Ros, A. Robles, M. E. Gomez, and J. Duato, "Tokentlb: A token-based page classification approach," in International Conference on Supercomputing (ICS), 2016, pp. 26:1--26:13.
[11]
A. Esteve, A. Ros, M. E. Gomez, A. Robles, and J. Duato, "Efficient tlb-based detection of private pages in chip multiprocessors," IEEE Transactions on Parallel and Distributed Systems, vol. 27, no. 3, pp. 748--761, Mar. 2016.
[12]
A. Esteve, A. Ros, M. E. Gomez, A. Robles, and J. Duato, "Tlb-based temporality-aware classification in cmps with multilevel tlbs," IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 8, pp. 2401--2413, Jan. 2017.
[13]
V. Papaefstathiou, M. G. Katevenis, D. S. Nikolopoulos, and D. Pnevmatikatos, "Prefetching and cache management using task lifetimes," in International Conference on Supercomputing (ICS), 2013, pp. 325--334.
[14]
M. Casas, M. Moreto, L. Alvarez, E. Castillo, D. Chasapis, T. Hayes, L. Jaulmes, O. Palomar, O. Unsal, A. Cristal et al., "Runtime-aware architectures," in International Conference on Parallel and Distributed Computing (Euro-Par), 2015, pp. 16--27.
[15]
M. Valero, M. Moreto, M. Casas, E. Ayguade, and J. Labarta, "Runtime-aware architectures: A first approach," International Journal on Supercomputing Frontiers and Innovations, vol. 1, no. 1, pp. 29--44, Jun. 2014.
[16]
M. Manivannan, V. Papaefstathiou, M. Pericas, and P. Stenström, "RADAR: runtime-assisted dead region management for last-level caches," in International Symposium on High Performance Computer Architecture (HPCA), 2016, pp. 644--656.
[17]
D. H. Albonesi, R. Balasubramonian, S. G. Dropsbo, S. Dwarkadas, E. G. Friedman, M. C. Huang, V. Kursun, G. Magklis, M. L. Scott, G. Semeraro, P. Bose, A. Buyuktosunoglu, P. W. Cook, and S. E. Schuster, "Dynamically tuning processor resources with adaptive processing," IEEE Computer, vol. 36, no. 12, pp. 49--58, Dec 2003.
[18]
P. Ranganathan, S. V. Adve, and N. P. Jouppi, "Reconfigurable caches and their application to media processing," in International Symposium on Computer Architecture (ISCA), 2000, pp. 214--224.
[19]
K. Varadarajan, S. K. Nandy, V. Sharda, B. Amrutur, R. R. Iyer, S. Makineni, and D. Newell, "Molecular caches: A caching structure for dynamic creation of application-specific heterogeneous cache regions," in International Symposium on Microarchitecture (MICRO), 2006, pp. 433--442.
[20]
M. Powell, S.-H. Yang, B. Falsafi, K. Roy, and T. N. Vijaykumar, "Gated-vdd: A circuit technique to reduce leakage in deep-submicron cache memories," in International Symposium on Low Power Electronics and Design (ISLPED), 2000, pp. 90--95.
[21]
N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The gem5 simulator," SIGARCH Computer Architure News, vol. 39, no. 2, pp. 1--7, Aug. 2011.
[22]
A. Duran, E. Ayguadé, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas, "OmpSs: A proposal for programming heterogeneous multi-core architectures," Parallel Processing Letters, vol. 21, no. 2, pp. 173--193, 2011.
[23]
S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, "McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures," in International Symposium on Microarchitecture (MICRO), 2009, pp. 469--480.
[24]
S. Xi, H. Jacobson, P. Bose, G.-Y. Wei, and D. Brooks, "Quantifying sources of error in McPAT and potential impacts on architectural studies," in International Symposium on High Performance Computer Architecture (HPCA), 2015, pp. 577--589.
[25]
N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, "CACTI 6.0: A tool to model large caches," 2009.
[26]
J. Balart, A. Duran, M. Gonzàlez, X. Martorell, E. Ayguadé, and J. Labarta, "Nanos mercurium: a research compiler for OpenMP," in European Workshop on OpenMP (EWOMP), 2004, pp. 103--109.
[27]
A. Gupta, W.-D. Weber, and T. Mowry, "Reducing memory and traffic requirements for scalable directory-based cache coherence schemes," in International Conference on Parallel Processing (ICPP), 1990, pp. 312--321.
[28]
J. H. Choi and K. H. Park, "Segment directory enhancing the limited directory cache coherence schemes," in International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing (IPPS/SPDP), 1999, pp. 258--267.
[29]
H. Zhao, A. Shriraman, and S. Dwarkadas, "Space: Sharing pattern-based directory coherence for multicore scalability," in International Conference on Parallel Architectures and Compilation (PACT), 2010, pp. 135--146.
[30]
D. Sanchez and C. Kozyrakis, "Scd: A scalable coherence directory with flexible sharer set encoding," in International Symposium on High Performance Computer Architecture (HPCA), 2012, pp. 1--12.
[31]
J. Zebchuk, V. Srinivasan, M. K. Qureshi, and A. Moshovos, "A tagless coherence directory," in International Symposium on Microarchitecture (MICRO), 2009, pp. 423--434.
[32]
M. Ferdman, P. Lotfi-Kamran, K. Balet, and B. Falsafi, "Cuckoo directory: A scalable directory for many-core systems," in International Symposium on High Performance Computer Architecture (HPCA), 2011, pp. 169--180.
[33]
A. Moshovos, "Regionscout: Exploiting coarse grain sharing in snoop-based coherence," in International Symposium on Computer Architecture (ISCA), 2005, pp. 234--245.
[34]
J. F. Cantin, M. H. Lipasti, and J. E. Smith, "Improving multiprocessor performance with coarse-grain coherence tracking," in International Symposium on Computer Architecture (ISCA), 2005, pp. 246--257.
[35]
M. Alisafaee, "Spatiotemporal coherence tracking," in International Symposium on Microarchitecture (MICRO), 2012, pp. 341--350.
[36]
J. Zebchuk, B. Falsafi, and A. Moshovos, "Multi-grain coherence directories," in International Symposium on Microarchitecture (MICRO), 2013, pp. 359--370.
[37]
"Programmer's Guide for ARMv8-A. Version 1.0. 2015."
[38]
B. Cuesta, A. Ros, M. E. Gomez, A. Robles, and J. Duato, "Increasing the effectiveness of directory caches by avoiding the tracking of noncoherent memory blocks," IEEE Transactions on Computers, vol. 62, no. 3, pp. 482--495, Mar. 2013.
[39]
P.-A. Tsai, N. Beckmann, and D. Sanchez, "Jenga: Software-defined cache hierarchies," in International Symposium on Computer Architecture (ISCA), 2017, pp. 652--665.
[40]
R. L. Bocchino, Jr., V. S. Adve, D. Dig, S. V. Adve, S. Heumann, R. Komuravelli, J. Overbey, P. Simmons, H. Sung, and M. Vakilian, "A type and effect system for deterministic parallel java," in Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2009, pp. 97--116.
[41]
B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, S. V. Adve, V. S. Adve, N. P. Carter, and C. Chou, "Denovo: Rethinking the memory hierarchy for disciplined parallelism," in International Conference on Parallel Architectures and Compilation (PACT), 2011, pp. 155--166.
[42]
H. Sung, R. Komuravelli, and S. V. Adve, "Denovond: efficient hardware support for disciplined non-determinism," in International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2013, pp. 138--148.
[43]
Y. Li, A. Abousamra, R. Melhem, and A. K. Jones, "Compiler-assisted data distribution for chip multiprocessors," in International Conference on Parallel Architectures and Compilation (PACT), 2010, pp. 501--512.
[44]
Y. Li, R. Melhem, and A. K. Jones, "Practically private: Enabling high performance cmps through compiler-assisted data classification," in International Conference on Parallel Architectures and Compilation (PACT), 2012, pp. 231--240.
[45]
S. Zuckerman, J. Suetterlein, R. Knauerhase, and G. R. Gao, "Using a "Codelet" program execution model for exascale machines: Position paper," in Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era (EXADAPT), 2011, pp. 64--69.
[46]
L. V. Kale and S. Krishnan, "CHARM++: A portable concurrent object oriented system based on C++," in Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 1993, pp. 91--108.
[47]
C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier, "StarPU: A unified platform for task scheduling on heterogeneous multicore architectures," in International Conference on Parallel and Distributed Computing (Euro-Par), 2009, pp. 863--874.
[48]
M. Bauer, S. Treichler, E. Slaughter, and A. Aiken, "Legion: Expressing locality and independence with logical regions," in International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2012, pp. 66:1--66:11.
[49]
K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan, "Sequoia: Programming the memory hierarchy," in International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2006, pp. 83:1--83:11.
[50]
J. Shirako, J. M. Zhao, V. K. Nandivada, and V. N. Sarkar, "Chunking parallel loops in the presence of synchronization," in International Conference on Supercomputing (ICS), 2009, pp. 181--192.
[51]
M. Aldinucci, M. Danelutto, P. Kilpatrick, M. Meneghin, and M. Torquati, "Accelerating code on multi-cores with fastflow," in International Conference on Parallel and Distributed Computing (Euro-Par), 2011, pp. 170--181.
[52]
W. Thies, M. Karczmarek, and S. P. Amarasinghe, "Streamit: A language for streaming applications," in International Conference on Compiler Construction (CC), 2002, pp. 179--196.
[53]
J. C. Beard, P. Li, and R. D. Chamberlain, "Raftlib: A c++ template library for high performance stream parallel processing," in International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM), 2015, pp. 96--105.
[54]
"The OpenACC Application Programming Interface. Version 2.5. October 2015."
[55]
R. Dolbeau, S. Bihan, and F. Bodin, "Hmpp: A hybrid multi-core parallel programming environment," in Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU), 2007.
[56]
R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou, "Cilk: An efficient multithreaded runtime system," in Symposium on Principles and Practice of Parallel Programming (PPoPP), 1995, pp. 207--216.
[57]
J. Reinders, Intel threading building blocks - outfitting C++ for multi-core processor parallelism. O'Reilly Media, 2007.
[58]
J. Planas, R. M. Badia, E. Ayguade, and J. Labarta, "Self-adaptive OmpSs tasks in heterogeneous environments," in International Parallel and Distributed Processing Symposium (IPDPS), 2013, pp. 138--149.
[59]
L. Alvarez, M. Casas, J. Labarta, E. Ayguade, M. Valero, and M. Moreto, "Runtime-guided management of stacked dram memories in task parallel programs," in International Conference on Supercomputing (ICS), 2018, pp. 379--391.
[60]
J. Bueno, X. Martorell, R. M. Badia, E. Ayguadé, and J. Labarta, "Implementing OmpSs support for regions of data in architectures with multiple address spaces," in International Conference on Supercomputing (ICS), 2013, pp. 359--368.
[61]
L. Alvarez, M. Moreto, M. Casas, E. Castillo, X. Martorell, J. Labarta, E. Ayguade, and M. Valero, "Runtime-guided management of scratchpad memories in multicore architectures," in International Conference on Parallel Architectures and Compilation (PACT), 2015, pp. 379--391.
[62]
I. Brumar, M. Casas, M. Moretó, M. Valero, and G. S. Sohi, "ATM: approximate task memoization in the runtime system," in International Parallel and Distributed Processing Symposium (IPDPS), 2017, pp. 1140--1150.
[63]
E. Castillo, M. Moretó, M. Casas, L. Alvarez, E. Vallejo, K. Chronaki, R. M. Badia, J. L. Bosque, R. Beivide, E. Ayguadé, J. Labarta, and M. Valero, "CATA: criticality aware task acceleration for multicore processors," in International Parallel and Distributed Processing Symposium (IPDPS), 2016, pp. 413--422.
[64]
P. Caheny, L. Alvarez, S. Derradji, M. Valero, M. Moretó, and M. Casas, "Reducing cache coherence traffic with a numa-aware runtime approach," IEEE Transactions on Parallel and Distributed Systems, vol. 29, no. 5, pp. 1174--1187, May 2018.
[65]
P. Caheny, M. Casas, M. Moretó, H. Gloaguen, M. Saintes, E. Ayguadé, J. Labarta, and M. Valero, "Reducing cache coherence traffic with hierarchical directory cache and numa-aware runtime scheduling," in International Conference on Parallel Architectures and Compilation (PACT), 2016, pp. 275--286.
[66]
M. Manivannan, A. Negi, and P. Stenström, "Efficient forwarding of producer-consumer data in task-based programs," in International Conference on Parallel Processing (ICPP), 2013, pp. 517--522.

Cited By

View all
  • (2023)Fine-grain data classification to filter token coherence trafficJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.09.004171:C(40-53)Online publication date: 1-Jan-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis
November 2018
932 pages

Sponsors

In-Cooperation

  • IEEE CS

Publisher

IEEE Press

Publication History

Published: 26 July 2019

Check for updates

Author Tags

  1. cache memory
  2. memory architecture
  3. parallel programming
  4. runtime environment

Qualifiers

  • Research-article

Conference

SC18
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Fine-grain data classification to filter token coherence trafficJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.09.004171:C(40-53)Online publication date: 1-Jan-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media