research-article

The case for a scalable coherence protocol for complex on-chip cache hierarchies in many core systems

Authors:

Lucia G. Menezo,

Valentin Puente,

Jose Angel GregorioAuthors Info & Claims

PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Pages 279 - 288

Published: 07 October 2013 Publication History

Abstract

This paper introduces a new coherence protocol that addresses the challenges of complex multilevel cache hierarchies in future many-core systems. In order to keep coherence protocol complexity bounded, inclusiveness is required to track coherence information across levels in this type of systems, but this might introduce unsustainable costs for directory structures. Cost reduction decisions taken to reduce this complexity may introduce artificial inefficiencies in the on-chip cache hierarchy, especially when the number of cores and private caches size is large. The coherence protocol presented in this work, denoted MOSAIC, introduces a new approach to tackle this problem. In energy terms, the protocol scales like a conventional directory coherence protocol, but relaxes the shared information inclusiveness. This allows the performance implications of directory size and associativity reduction to be overcome. Contrary to the common belief that inclusiveness is inescapable when attempting to maintain complexity constrained, MOSAIC is even simpler than a conventional directory. The results of our evaluation show that the approach is quite insensitive, in terms of performance and energy expenditure, to the size and associativity of the directory.

References

[1]

M. M. K. Martin, M. D. Hill, and D. J. Sorin, "Why on-chip cache coherence is here to stay," Communications of the ACM, vol. 55, no. 7, p. 78, Jul. 2012.

Digital Library

[2]

R. Kalla, B. Sinharoy, W. J. Starke, and M. Floyd, "Power7: IBM's Next-Generation Server Processor," IEEE Micro, vol. 30, no. 2, pp. 7--15, 2010.

Digital Library

[3]

"Tilera. TILE-Gx 3000 Series Overview.," 2011.

[4]

M. Butler, "AMD 'Bulldozer' Core - a new approach to multithreaded compute performance for maximum efficiency and throughput," in IEEE HotChips Symposium on High-Performance Chips (HotChips 2010), 2010.

[5]

N. Kurd, J. Douglas, P. Mosalikanti, and R. Kumar, "Next generation Intel® micro-architecture (Nehalem) clocking architecture," in IEEE Symposium on VLSI Circuits, 2008, pp. 62--63.

[6]

J. L. Shin, H. Park, H. Li, A. Smith, Y. Choi, H. Sathianathan, S. Dash, S. Turullols, S. Kim, R. Masleid, G. Konstadinidis, R. Golla, M. J. Doherty, G. Grohoski, and C. McAllister, "The next-generation 64b SPARC core in a T4 SoC processor," IEEE Journal of Solid-State Circuits, vol. 48, no. 1, pp. 82--90, Feb. 2013.

[7]

B. M. Rogers, A. Krishna, G. B. Bell, K. Vu, X. Jiang, and Y. Solihin, "Scaling the bandwidth wall: challenges in and avenues for CMP scaling," in 36th International Symposium on Computer Architecture (ISCA'09), 2009, vol. 37, no. 3, pp. 371--382.

Digital Library

[8]

F. Busaba, M. A. Blake, B. Curran, M. Fee, C. Jacobi, P.-K. Mak, B. R. Prasky, and C. R. Walters, "IBM zEnterprise 196 microprocessor and cache subsystem," IBM Journal of Research and Development, vol. 56, no. 1, pp. 1:1--1:12, Jan. 2012.

Digital Library

[9]

P. Conway, N. Kalyanasundharam, G. Donley, K. Lepak, and B. Hughes, "Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor," IEEE Micro, vol. 30, no. 2, pp. 16--29, 2010.

Digital Library

[10]

A. Raghavan, C. Blundell, and M. M. K. Martin, "Token tenure: PATCHing token counting using directory-based cache coherence," in 41st IEEE/ACM International Symposium on Microarchitecture, 2008, pp. 47--58.

Digital Library

[11]

S. Przybylski, M. Horowitz, and J. Hennessy, "Characteristics Of Performance-Optimal Multi-level Cache Hierarchies," in 16th International Symposium on Computer Architecture (ISCA'89), 1989, pp. 114--121.

Digital Library

[12]

A. Jaleel, E. Borch, M. Bhandaru, S. C. Steely Jr., and J. Emer, "Achieving Non-Inclusive Cache Performance with Inclusive Caches: Temporal Locality Aware (TLA) Cache Management Policies," in 43rd IEEE/ACM International Symposium on Microarchitecture, 2010, pp. 151--162.

Digital Library

[13]

A. Gupta, W. Weber, and T. Mowry, "Reducing memory and traffic requirements for scalable directory-based cache coherence schemes," Springer US, pp. 167--192, 1992.

[14]

M. M. K. M. K. Martin, M. D. D. Hill, and D. a. A. Wood, "Token Coherence: Decoupling Performance and Correctness," in 30th International Symposium on Computer Architecture (ISCA'03), 2003, pp. 182--193.

Digital Library

[15]

J.-L. Baer and W.-H. Wang, "On the inclusion properties for multi-level cache hierarchies," ACM SIGARCH Computer Architecture News, vol. 16, no. 2, pp. 73--80, May 1988.

Digital Library

[16]

J. Zebchuk, V. Srinivasan, M. K. Qureshi, and A. Moshovos, "A tagless coherence directory," in 42nd IEEE/ACM International Symposium on Microarchitecture, 2009, pp. 423--434.

Digital Library

[17]

"OpenSPARC TM T2 system-on-chip (SoC) microarchitecture specification," 2008.

[18]

D. Sanchez and C. Kozyrakis, "SCD: A scalable coherence directory with flexible sharer set encoding," in 18th IEEE International Symposium on High Performance Computer Architecture, 2012, pp. 1--12.

Digital Library

[19]

B. A. Cuesta, A. Ros, M. E. Gómez, A. Robles, and J. F. Duato, "Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks," in 38th International Symposium on Computer Architecture (ISCA'11), 2011, pp. 93--104.

Digital Library

[20]

D. Sanchez and C. Kozyrakis, "The ZCache: Decoupling Ways and Associativity," in 43rd IEEE/ACM International Symposium on Microarchitecture, 2010, pp. 187--198.

Digital Library

[21]

M. Ferdman, P. Lotfi-Kamran, K. Balet, and B. Falsafi, "Cuckoo directory: A scalable directory for many-core systems," in 2011 IEEE 17th International Symposium on High Performance Computer Architecture, 2011, pp. 169--180.

Digital Library

[22]

H. Zhao, A. Shriraman, S. Dwarkadas, and V. Srinivasan, "SPATL: Honey, I Shrunk the Coherence Directory," in 20th International Conference on Parallel Architectures and Compilation Techniques (PACT'11), 2011, pp. 33--44.

Digital Library

[23]

J. H. Kelm, M. R. Johnson, S. S. Lumetta, and S. J. Patel, "WayPoint: Scaling Coherence to 1000-core Architectures," in 19th International Conference on Parallel Architectures and Compilation Techniques (PACT'10), 2010, pp. 99--110.

Digital Library

[24]

M. M. K. Martin, M. D. Hill, and D. A. Wood, "Token Coherence: a new framework for shared-memory multiprocessors," IEEE Micro, vol. 23, no. 6, pp. 108--116, 2003.

Digital Library

[25]

L. G. Menezo, V. Puente, P. Abad, and J. A. Gregorio, "Improving coherence protocol reactiveness by trading bandwidth for latency," in 9th ACM International Conference on Computing Frontiers (CF'12), 2012, pp. 143--152.

Digital Library

[26]

D. J. Sorin, M. Plakal, A. E. Condon, M. D. Hill, M. M. K. Martin, and D. A. Wood, "Specifying and verifying a broadcast and a multicast snooping cache coherence protocol," IEEE Transactions on Parallel and Distributed Systems, vol. 13, no. 6, pp. 556--578, Jun. 2002.

Digital Library

[27]

"Mosaic Protocol Specification." {Online}. Available: http://www.atc.unican.es/galerna/mosaic.

[28]

J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. W. Keckler, "A NUCA substrate for flexible CMP cache sharing," IEEE Transactions on Parallel and Distributed Systems, vol. 18, no. 8, pp. 1028--1040, 2007.

Digital Library

[29]

N. E. Jerger, L. S. Peh, and M. Lipasti, "Virtual circuit tree multicasting: A case for on-chip hardware multicast support," in 35th International Symposium on Computer Architecture (ISCA'08), 2008, pp. 229--240.

Digital Library

[30]

A. R. Alameldeen, M. M. K. Martin, C. J. Mauer, K. E. Moore, M. D. Hill, D. A. Wood, and D. J. Sorin, "Simulating a $2M Commercial Server on a $2K PC," Computer, vol. 36, no. 2, pp. 50--57, Feb. 2003.

Digital Library

[31]

H. Jin, M. Frumkin, and J. Yan, "The OpenMP Implementation of NAS Parallel Benchmarks and its Performance," NAS Technical Report NAS-99-011, NASA Ames Research Center, Moffett Field, CA, 1999.

[32]

SPEC Standard Performance Evaluation Corporation, "SPEC 2006." {Online}. Available: http://www.spec.org.

[33]

M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood, "Multifacet's General Execution-driven Multiprocessor Simulator (GEMS) Toolset," Computer Architecture News, 2005.

Digital Library

[34]

N. Muralimanohar, R. Balasubramonian, and N. Jouppi, "Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0," in 40th IEEE/ACM International Symposium on Microarchitecture, 2007, pp. 3--14.

Digital Library

[35]

C. Sun, C.-H. O. Chen, G. Kurian, L. Wei, J. Miller, A. Agarwal, L.-S. Peh, and V. Stojanovic, "DSENT - A Tool Connecting Emerging Photonics with Electronics for Opto-Electronic Networks-on-Chip Modeling," 6th IEEE/ACM International Symposium on Networks-on-Chip, pp. 201--210, 2012.

Digital Library

[36]

G. H. Loh and M. D. Hill, "Efficiently enabling conventional block sizes for very large die-stacked DRAM caches," in 44th IEEE/ACM International Symposium on Microarchitecture, 2011, pp. 454--464.

Digital Library

Cited By

Menezo LPuente VAbad PGregorio J(2018)MosaicInternational Journal of Parallel Programming10.1007/s10766-018-0557-y46:6(1110-1138)Online publication date: 1-Dec-2018
https://dl.acm.org/doi/10.1007/s10766-018-0557-y
Menezo LPuente VGregorio J(2017)An adaptive cache coherence protocolJournal of Parallel and Distributed Computing10.1016/j.jpdc.2016.12.020102:C(163-174)Online publication date: 1-Apr-2017
https://dl.acm.org/doi/10.1016/j.jpdc.2016.12.020
Zakkak FPratikakis PZheng YBinder WTůma P(2016)DiSquawkProceedings of the 13th International Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools10.1145/2972206.2972212(1-12)Online publication date: 29-Aug-2016
https://dl.acm.org/doi/10.1145/2972206.2972212

Index Terms

The case for a scalable coherence protocol for complex on-chip cache hierarchies in many core systems
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data

Recommendations

An adaptive cache coherence protocol

This paper introduces a new adaptive cache coherence protocol which minimizes energy requirements and guarantees scalability. It includes two complementary parts: a non-inclusive sparse-directory to track only actively shared blocks and a structure to ...
Reusability-aware cache memory sharing for chip multiprocessors with private L2 caches

In this paper, we propose a novel on-chip L2 cache organization for chip multiprocessors (CMPs) with private L2 caches. The proposed approach, called reusability-aware cache sharing (RACS), combines the advantages of both a private L2 cache and a shared ...
Performance Analysis of Cache Coherence Protocols for Multi-core Architectures: A System Attribute Perspective
AICTC '16: Proceedings of the International Conference on Advances in Information Communication Technology & Computing

Shared memory multi-core processors are becoming dominant in todays computer architectures. Caching of shared data may produce a problem of replication in multiple caches. Replication provides reduction in contention for shared data items along with ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

October 2013

422 pages

ISBN:9781479910212

Conference Chair:
Christian Fensch
University of Edinburgh, UK
,
General Chair:
Michael O'Boyle
University of Edinburgh, UK
,
Program Chairs:
André Seznec
INRIA Rennes, France
,
François Bodin
IRISA/CAPS Entreprise, France

Sponsors

IFIP WG 10.3: IFIP WG 10.3
IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE CS TCPP: IEEE Computer Society Technical Committee on Parallel Processing

Publisher

IEEE Press

Publication History

Published: 07 October 2013

Check for updates

Author Tags

Qualifiers

Research-article

Acceptance Rates

PACT '13 Paper Acceptance Rate 36 of 208 submissions, 17%;

Overall Acceptance Rate 121 of 471 submissions, 26%

Upcoming Conference

PACT '24

Sponsor:
sigarch

International Conference on Parallel Architectures and Compilation Techniques

October 13 - 16, 2024

Long Beach , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
223
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Menezo LPuente VAbad PGregorio J(2018)MosaicInternational Journal of Parallel Programming10.1007/s10766-018-0557-y46:6(1110-1138)Online publication date: 1-Dec-2018
https://dl.acm.org/doi/10.1007/s10766-018-0557-y
Menezo LPuente VGregorio J(2017)An adaptive cache coherence protocolJournal of Parallel and Distributed Computing10.1016/j.jpdc.2016.12.020102:C(163-174)Online publication date: 1-Apr-2017
https://dl.acm.org/doi/10.1016/j.jpdc.2016.12.020
Zakkak FPratikakis PZheng YBinder WTůma P(2016)DiSquawkProceedings of the 13th International Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools10.1145/2972206.2972212(1-12)Online publication date: 29-Aug-2016
https://dl.acm.org/doi/10.1145/2972206.2972212

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents