research-article

A composite and scalable cache coherence protocol for large scale CMPs

Authors:

Jun YangAuthors Info & Claims

ICS '11: Proceedings of the international conference on Supercomputing

Pages 285 - 294

https://doi.org/10.1145/1995896.1995941

Published: 31 May 2011 Publication History

Abstract

The number of on-chip cores of modern chip multiprocessors (CMPs) is growing fast with technology scaling. However, it remains a big challenge to efficiently support cache coherence for large scale CMPs. The conventional snoopy and directory coherence protocols cannot be smoothly scaled to many-core or thousand-core processors. Snoopy protocols introduce large power overhead due to enormous amount of cache tag probing triggered by broadcast. Directory protocols introduce performance penalty due to indirection, and large storage overhead due to storing directories.

This paper addresses the efficiency problem when supporting cache coherency for large-scale CMPs. By leveraging emerging optical on-chip interconnect (OP-I) technology to provide high bandwidth density, low propagation delay and natural support for multicast/broadcast in a hierarchical network organization, we propose a composite cache coherence (C³) protocol that benefits from direct cache-to-cache accesses as in snoopy protocol and small amount of cache probing as in directory protocol. Targeting at quickly completing coherence transactions, C³ organizes accesses in a three-tier hierarchy by combining a mix of designs including local broadcast prediction, filtering, and a coarse-grained directory. Compared to directory-based protocol[18], our evaluations on a thousand-core CMP show that C³ improves performance by 21%, reduces network latency of coherence messages by 41% and saves network energy consumption by 5.5% on average for PARSEC applications.

References

[1]

M. E. Acacio, et. al., "Owner Prediction for Accelerating Cache-to-Cache Transfer Misses in a cc-NUMA Architecture," In SC, 2002.

Digital Library

[2]

M. E. Acacio, et. al., "The use of Prediction for Accelerating Upgrade Misses in a cc-NUMA Multiprocessors,"In PACT, pp. 155--164, 2002.

Digital Library

[3]

A. Agarwal, et. al., "An Evaluation of Directory Schemes for Cache Coherence,"In ISCA, pp.353--362, 1988.

Digital Library

[4]

N. Agarwal, et. al., "In-Network Coherence Filtering: Snoopy Coherence without Broadcasts,"In MICRO, 2009.

Digital Library

[5]

N. Agarwal, et. al., "In-Network Snoop Ordering: Snoopy Coherence on Unordered Interconnects,"In HPCA, 2009.

[6]

J. Balfour and W. J. Dally, "Design tradeoffs for tiled cmp onchip networks,"In ICS, pp.187--198, 2006.

Digital Library

[7]

C. Batten and et. al., "Building Manycore Processor-to-DRAM Networks with Monolithic Silicon Photonics,"In High Performance Interconnects, pp.21--30, 2008.

Digital Library

[8]

S. Beamer, et. al., "Re-Architecting DRAM Memory Systems with Monolithically Integrated Silicon Photonics,"In ISCA, pp.117--128, 2010.

Digital Library

[9]

C. Bienia, et. al., "The parsec benchmark suite: Characterization and architectural implications,"In PACT, pp.72--81,2008.

Digital Library

[10]

B. Black, et. al., "Die stacking (3d) microarchitecture,"In MICRO pp. 469--479, 2006.

Digital Library

[11]

S. Borkar, "Thousand core chips - a technology perspective,"In DAC, pp.746--749, 2007.

Digital Library

[12]

CACTI, http://www.hpl.hp.com/research/cacti/

[13]

L. M. Censier and P. Feautrier," A New Solution to Coherence Problems in Multicache Systems,"In IEEE Trans. on Computers, pp. 1112--1118, 1978.

Digital Library

[14]

M. J. Cianchetti, et. al., "Phastlane: A Rapid Transit Optical Routing Network,"In ISCA, pp.441--450, 2009.

Digital Library

[15]

S. Chaudhry, et. al., "Rock: A High-Performance Sparc CMT Processor,"In IEEE Micro, 29(2):6--16, 2009.

Digital Library

[16]

W. J. Dally and B. Towles, "Principles and practices of Interconnection Networks,"Morgan Kaufmann, 2004.

Digital Library

[17]

N. Eisley, et. al., "In-network cache coherence,"In MICRO, pp. 321--332, 2006.

Digital Library

[18]

A. Gupta, et. al., "Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes,"In ICPP, pp. 312--321, 1990.

[19]

J.-H. Ha and T. M. Pinkston," A Hybrid Cache Coherence Protocol for a Decoupled Multi-Channel Optical Network: SPEED DMON, "In ICPP, pp.164--171, 1996.

[20]

L. Hammond, et. al., ";A single-chip multiprocessor,"In IEEE Computer, 30(9):79--85, 1997.

Digital Library

[21]

Semiconductor Industry Association,"International Technology Roadmap for Semiconductors,"http://www.itrs.net/Links/2009ITRS/Home2009.htm, 2009.

[22]

A. Jaleel, et. al., "High performance cache replacement using re-reference interval prediction (RRIP),"In &ISCA, pp.60--71, 2010.

Digital Library

[23]

N. Enright-Jerger, et. al.," Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support,"In ISCA, 2008.

[24]

N. Enright-Jerger, et. al., "Virtual Tree Coherence: Leveraging Regions and In-Network Multicast Trees for Scalable Cache Coherence,"In MICRO, 2008.

[25]

A. Joshi, et. al., "Silicon-Photonic Clos Networks for Global On-Chip Communication,"In &NOCS, 2009.

Digital Library

[26]

J. H. Kelm, et. al., "WAYPOINT: scaling coherence to 1000-core architectures,"In PACT, pp. 99--109, 2010.

Digital Library

[27]

T. Kgil, et. al., "Picoserver:Using 3d stacking technology to enable a compact energy efficient chip multiprocessor,"In ASPLOS, pp. 117--128, 2006.

Digital Library

[28]

J. Kim, et. al., "Flattened butterfly topology for on-chip networks,"In MICRO, 2007.

Digital Library

[29]

N. Kirman, et. al., "Leveraging optical technology in future bus-based chip multiprocessors,"In MICRO, pp. 492--503, 2006.

Digital Library

[30]

N. Kirman and J. Martinez," An efficient all-optical on-chip interconnect based on oblivious routing,"In ASPLOS, 2010.

Digital Library

[31]

D. Lenoski, et. al.," Design and Scalble Shared-Memory Multiprocessors: The DASH Approach,"In COMPCON, pp. 62--67, 1990.

[32]

Z. Li, et. al.,"Spectrum: A Hybrid Nanophotonic-Electric On-Chip Network,"In DAC, pp. 575--580, 2009.

Digital Library

[33]

M. M. K. Martin, et. al., "Bandwidth Adaptive Routing,"In HPCA, 2002.

[34]

M. M. K. Martin, et. al., "Token Coherence: Decoupling Performance and Correctness,"In ISCA, 2003.

Digital Library

[35]

M. Marty, et. al., "Improving multiple-cmp systems using token coherence,"In HPCA, 2005.

Digital Library

[36]

D. Miller, "Rationale and Challenges for Optical Interconnects to Electronic Chips,"In Proceedings of the IEEE, 88(6):728--749, 2000.

[37]

G. Kurian, et. al.,"ATAC: A 1000-core cache-coherent processor with on-chip optical network,"In PACT, pp.447--488, 2010.

Digital Library

[38]

A. Moshovos, et. al., "JETTY: Filtering snoops for reduced energy consumption in SMP servers,"In HPCA, pp.85--96, 2001.

Digital Library

[39]

"Noxim, An Open Network-on-Chip Simulator,"http://noxim.sourceforge.net

[40]

nVidia, "Quadro fx 3700m," http://www.nvidia.com/object/product_quadro_fx_3700_m_us.html.

[41]

K. Olukotun, et. al., "The case for a single-chip multiprocessor," In ASPLOS, pp. 2--11, 1996.

Digital Library

[42]

Y. Pan, et. al.," Firefly: Illuminating Future Network-on-Chip with Nanophotonics,"Int. Symp. on Computer Architecture, ISCA';09, pp. 429--440, 2009.

Digital Library

[43]

Y. Pan In Int. Symp. on High-Performance Computer Architecture (HPCA),2010.

[44]

PTLsim.http://www.ptlsim.org/

[45]

PTM interconnect model.http://www.eas.asu.edu/~ptm/ interconnect.html

[46]

K. Strauss, et. al., "Uncorq: Unconstrained snoop request delivery in embedded-ring multiprocessors,"In Int. Symp. on Micorarchitecture, pp. 327--342, 2007.

Digital Library

[47]

A. N. Udipi, et. al.,"Towards Scalable, Energy-Efficient Bus-Based On-Chip Networks,"In Int. Symp. on High-Performance Computer Architecture (HPCA), pp. 1--12, 2010.

[48]

S. Vangal, et. al., "An 80-tile 1.28tflops network-on-chip in 65nm cmos,"In IEEE Int. Solid-State Circuits Conf., pp. 98--590, 2007.

[49]

D. Vantrease, et. al., "Corona: System implications of emerging nanophotonic technology,"In Int. Symp. on Computer Architecture, pp.153--164, 2008.

Digital Library

[50]

D. Vantrease, et. al., "Atomic Coherence: Leveraging Nanophotonics to Build Race-Free Cachec Coherence Protocols,"In Int. Symp. on High-Performance Computer Architecture (HPCA), 2011.

Digital Library

[51]

J. Xue, et. al.,"An Intra-Chip Free-Space Opitcal Interconnect," Int. Symp. on Computer Architecture, ISCA2010.

Digital Library

[52]

J. Zebchuk, et. al., "A Tagless Coherence Directory," In Int. Symp. on Microarchitecture, MICRO, pp. 423--434, 2009.

Digital Library

Cited By

Nisa UBashir J(2024)Towards Efficient On-Chip Communication: A Survey on Silicon Nanophotonics and Optical Networks-on-ChipJournal of Systems Architecture10.1016/j.sysarc.2024.103171152(103171)Online publication date: Jul-2024
https://doi.org/10.1016/j.sysarc.2024.103171
Li CJiang FChen SZhang JLiu YFu YXu JMitra TYoung EXiong J(2022)Accelerating Cache Coherence in Manycore Processor through Silicon Photonic ChipletProceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design10.1145/3508352.3549338(1-9)Online publication date: 30-Oct-2022
https://dl.acm.org/doi/10.1145/3508352.3549338
Wang ZWang ZXu JChang YFeng JChen XChen SZhang J(2020)CAMON: Low-Cost Silicon Photonic Chiplet for Manycore ProcessorsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2019.292649539:9(1820-1833)Online publication date: Sep-2020
https://doi.org/10.1109/TCAD.2019.2926495
Show More Cited By

Index Terms

A composite and scalable cache coherence protocol for large scale CMPs
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data
2. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

Snooping and Ordering Ring - An Efficient Cache Coherence Protocol for Ring Connected CMP
ICPADS '09: Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems

Ring is a promising on-chip interconnection for CMP. It is more scalable than bus and much simpler than packet-switched networks. The ordering property of ring can be used to optimize cache coherence protocol design. Existing ring protocols, such as the ...
Replacement techniques for dynamic NUCA cache designs on CMPs

The growing influence of wire delay in cache design has meant that access latencies to last-level cache banks are no longer constant. Non-Uniform Cache Architectures (NUCAs) have been proposed to address this problem. Furthermore, an efficient last-...
A Direct Coherence Protocol for Many-Core Chip Multiprocessors

Future many-core CMP designs that will integrate tens of processor cores on-chip will be constrained by area and power. Area constraints make impractical the use of a bus or a crossbar as the on-chip interconnection network, and tiled CMPs organized ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '11: Proceedings of the international conference on Supercomputing

May 2011

398 pages

ISBN:9781450301022

DOI:10.1145/1995896

General Chair:
David K. Lowenthal
University of Arizona
,
Program Chairs:
Bronis R. de Supinski
Lawrence Livermore National Laboratory
,
Sally A. McKee
Chalmers University of Technology

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICS '11

Sponsor:

SIGARCH

ICS '11: International Conference on Supercomputing

May 31 - June 4, 2011

Arizona, Tucson, USA

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
609
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)0

Reflects downloads up to 06 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Nisa UBashir J(2024)Towards Efficient On-Chip Communication: A Survey on Silicon Nanophotonics and Optical Networks-on-ChipJournal of Systems Architecture10.1016/j.sysarc.2024.103171152(103171)Online publication date: Jul-2024
https://doi.org/10.1016/j.sysarc.2024.103171
Li CJiang FChen SZhang JLiu YFu YXu JMitra TYoung EXiong J(2022)Accelerating Cache Coherence in Manycore Processor through Silicon Photonic ChipletProceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design10.1145/3508352.3549338(1-9)Online publication date: 30-Oct-2022
https://dl.acm.org/doi/10.1145/3508352.3549338
Wang ZWang ZXu JChang YFeng JChen XChen SZhang J(2020)CAMON: Low-Cost Silicon Photonic Chiplet for Manycore ProcessorsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2019.292649539:9(1820-1833)Online publication date: Sep-2020
https://doi.org/10.1109/TCAD.2019.2926495
Bashir JPeter ESarangi S(2019)A Survey of On-Chip Optical InterconnectsACM Computing Surveys10.1145/326793451:6(1-34)Online publication date: 28-Jan-2019
https://dl.acm.org/doi/10.1145/3267934
Lin JLu JCai JShrivastava A(2019)Efficient Heap Data Management on Software Managed Manycore Architectures2019 32nd International Conference on VLSI Design and 2019 18th International Conference on Embedded Systems (VLSID)10.1109/VLSID.2019.00065(269-274)Online publication date: Jan-2019
https://doi.org/10.1109/VLSID.2019.00065
Xu YYang JMelhem R(2018)A Process-Variation-Tolerant Method for Nanophotonic On-Chip NetworkACM Journal on Emerging Technologies in Computing Systems10.1145/320807314:2(1-23)Online publication date: 11-Jul-2018
https://dl.acm.org/doi/10.1145/3208073
Grani PBartolini S(2018)Scalable Path-Setup Scheme for All-Optical Dynamic Circuit Switched NoCs in Cache Coherent CMPsACM Journal on Emerging Technologies in Computing Systems10.1145/315484014:1(1-27)Online publication date: 8-Mar-2018
https://dl.acm.org/doi/10.1145/3154840
Dumas JGuthmuller EFuguet Tortolero CPétrot F(2017)A Method for Fast Evaluation of Sharing Set Management Strategies in Cache Coherence ProtocolsArchitecture of Computing Systems - ARCS 201710.1007/978-3-319-54999-6_9(111-123)Online publication date: 4-Mar-2017
https://doi.org/10.1007/978-3-319-54999-6_9
Cai JShrivastava A(2016)Software Coherence Management on Non-coherent Cache Multi-coresProceedings of the 2016 29th International Conference on VLSI Design and 2016 15th International Conference on Embedded Systems (VLSID)10.1109/VLSID.2016.70(397-402)Online publication date: 4-Jan-2016
https://dl.acm.org/doi/10.1109/VLSID.2016.70
Jian Cai Shrivastava A(2016)Efficient pointer management of stack data for software managed multicores2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP)10.1109/ASAP.2016.7760774(67-74)Online publication date: Jul-2016
https://doi.org/10.1109/ASAP.2016.7760774
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten