Article

In-Network Cache Coherence

Authors:

Li ShangAuthors Info & Claims

MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 321 - 332

https://doi.org/10.1109/MICRO.2006.27

Published: 09 December 2006 Publication History

Abstract

With the trend towards increasing number of processor cores in future chip architectures, scalable directory-based protocols for maintaining cache coherence will be needed. However, directory-based protocols face well-known problems in delay and scalability. Most current protocol optimizations targeting these problems maintain a firm abstraction of the interconnection network fabric as a communication medium: protocol optimizations consist of endto- end messages between requestor, directory and sharer nodes, while network optimizations separately target lowering communication latency for coherence messages. In this paper, we propose an implementation of the cache coherence protocol within the network, embedding directories within each router node that manage and steer requests towards nearby data copies, enabling in-transit optimization of memory access delay. Simulation results across a range of SPLASH-2 benchmarks demonstrate significant performance improvement and good system scalability, with up to 44.5% and 56% savings in average memory access latency for 16 and 64-node systems, respectively, when compared against the baseline directory cache coherence protocol. Detailed microarchitecture and implementation characterization affirms the low area and delay impact of in-network coherence.

References

[1]

{1} http://www-128.ibm.com/developerworks/power/library/pa- expert1.html.

[2]

{2} http://www.intel.com/multi-core/.

[3]

{3} http://www.sun.com/processors/throughput/.

[4]

{4} "International technology roadmap for semiconductors," http://public.itrs.net.

[5]

{5} J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2003.

Digital Library

[6]

{6} D. L. Dill, "The mur¿ verification system." in Proc. 8th Int. Conf. Comp. Aided Verif., Aug. 1996, pp. 390-393.

Digital Library

[7]

{7} W. Dally and B. Towles, Principles and Practices of Interconnection Networks. San Francisco, CA: Morgan Kaufmann Publishers, 2003.

Digital Library

[8]

{8} L. Lamport, "How to make a multiprocessor computer that correctly executes multiprocess programs," IEEE Trans. on Comp., vol. c-28, no. 9, pp. 690-691, Sept. 1979.

Digital Library

[9]

{9} http://www-flash.stanford.edu/apps/SPLASH/.

[10]

{10} K. P. Lawton, "Bochs: A portable pc emulator for unix/x," Linux J., vol. 1996, no. 29es, p. 7, 1996.

Digital Library

[11]

{11} M. Zhang and K. Asanovic, "Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors," in Proc. 32nd Int. Symp. Comp. Arch., Jun. 2005, pp. 336-345.

Digital Library

[12]

{12} S. Mukherjee, et al., "The Alpha 21364 network architecture," in Proc. Hot Interconnects 9, Aug. 2001.

Digital Library

[13]

{13} S. J. Wilton and N. P. Jouppi, "An enhanced access and cycle time model for on-chip caches," DEC Western Research Laboratory, Tech. Rep. 93/5, 1994.

[14]

{14} M. B. Taylor et al., "The RAW microprocessor: A computational fabric for software circuits and general-purpose programs," IEEEMICRO , vol. 22, no. 2, pp. 25-35, Mar./Apr. 2002.

Digital Library

[15]

{15} A. Agarwal et al., "An evaluation of directory schemes for cache coherence," in Proc. 15th Int. Symp. Comp. Arch., Jun. 1988, pp. 280-289.

Digital Library

[16]

{16} S. Gjessing, et al., "The SCI cache coherence protocol," Kluwer Academic Publishers, 1992.

[17]

{17} S. Kaxiras and J. R. Goodman, "The glow cache coherence protocol extensions for widely shared data," in Proc. 10th int. conf. Supercomputing , May 1996, pp. 35-43.

Digital Library

[18]

{18} L.-S. Peh and W. J. Dally, "A delay model and speculative architecture for pipelined routers," in Proc. 7th Int. Symp. High Perf. Comp. Arch., Jan. 2001, pp. 255-266.

Digital Library

[19]

{19} R. Stets, et al., "The effect of network total order, broadcast, and remote-write capability on network-based shared memory computing," in Proc. 6th Int. Symp. High Perf. Comp. Arch., Feb. 2000, pp. 265-276.

[20]

{20} D. Dai and D. K. Panda, "Exploiting the benefits of multiple-path network in DSM systems: Architectural alternatives and performance evaluation," IEEE Trans. Comp., vol. 48, no. 2, pp. 236-244, 1999.

Digital Library

[21]

{21} D. Dai and D. Panda, "Reducing cache invalidation overheads in wormhole routed DSMs using multidestination message passing," in Proc. 1996 Int. Conf. Par. Processing, Aug. 1996, pp. 138-145.

[22]

{22} E. E. Bilir, et al., "Multicast snooping: a new coherence method using a multicast address network," in Proc. 26th Int. Symp. Comp. Arch., Jun. 1999, pp. 294-304.

Digital Library

[23]

{23} L. Barroso et al., "Piranha: A scalable architecture based on single-chip multiprocessing," in Proc. 27th Int. Symp. Comp. Arch., Jun. 2000, pp. 282-293.

Digital Library

[24]

{24} S. V. Adve and K. Gharachorloo, "Shared memory consistency models: A tutorial," IEEE Computer, vol. 29, no. 12, pp. 66-76, 1996.

Digital Library

[25]

{25} D. Lenoski, et al., "The DASH prototype: implementation and performance," SIGARCH Comp. Arch. News, vol. 20, no. 2, pp. 92-103, 1992.

Digital Library

[26]

{26} J. Laudon and D. Lenoski, "The SGI Origin: A ccNUMA highly scalable server," in Proc. 24th Int. Symp. Comp. Arch., Jun. 1997, pp. 241-251.

Digital Library

[27]

{27} X. Shen, Arvind, and L. Rudolph, "CACHET: an adaptive cache coherence protocol for distributed shared-memory systems," in Proc. 13th Int. Conf. Supercomputing, Jun. 1999, pp. 135-144.

Digital Library

[28]

{28} J. Huh, et al., "Speculative incoherent cache protocols," IEEE Micro, vol. 24, no. 6, Nov./Dec. 2004.

Digital Library

[29]

{29} M. M. K. Martin, M. D. Hill, and D. A. Wood, "Token coherence: Decoupling performance and correctness," in Proc. 30th Int. Symp. Comp. Arch., Jun. 2003, pp. 182-193.

Digital Library

[30]

{30} D. Chaiken, J. Kubiatowicz, and A. Agarwal, "Limitless directories: A scalable cache coherence scheme," in Proc. 4th Int. Conf. on Arch. Support for Prog. Lang. and Op. Sys., Jun. 1991, pp. 224-234.

Digital Library

[31]

{31} M. E. Acacio, et al., "A new scalable directory architecture for large-scale multiprocessors," in Proc. 7th Int. Symp. High Perf. Comp. Arch., Jan. 2001, pp. 97-106.

Digital Library

[32]

{32} H. Nilsson and P. Stenström, "The Scalable Tree Protocol - A Cache Coherence Approach for Large-Scale Multiprocessors," in Proc. 4th IEEE Symp. Par. and Dist. Processing, Dec. 1992, pp. 498-506.

[33]

{33} Y.-C. Maa, D. K. Pradhan, and D. Thiebaut, "Two economical directory schemes for large-scale cache coherent multiprocessors," SIGARCH Comp. Arch. News, vol. 19, no. 5, p. 10, 1991.

Digital Library

[34]

{34} L. Barroso and M. Dubois, "Performance evaluation of the slotted ring multiprocessor," in IEEE Trans. Comp., July 1995, pp. 878- 890.

Digital Library

[35]

{35} L. Cheng, et al., "Interconnect-aware coherence protocols," in Proc. 33rd Int. Symp. Comp. Arch., Jun. 2006, pp. 339-351.

Digital Library

[36]

{36} H. E. Mizrahi, et al., "Introducing memory into the switch elements of multiprocessor interconnection networks," in Proc. 16th Int. Symp. Comp. Arch., Jun. 1989, pp. 158-166.

Digital Library

[37]

{37} J. R. Goodman and P. J. Woest, "The wisconsin multicube: a new large-scale cache-coherent multiprocessor," in Proc. 15th Int. Symp. Comp. Arch., Jun. 1988, pp. 422-431.

Digital Library

Cited By

Fiolhais LSousa L(2023)Transient-Execution Attacks: A Computer Architect PerspectiveACM Computing Surveys10.1145/360361956:3(1-38)Online publication date: 6-Oct-2023
https://dl.acm.org/doi/10.1145/3603619
Yang JYao ZYang BTan XWang ZZheng Q(2019)Software-Defined Multimedia Streaming System Aided By Variable-Length Interval In-Network CachingIEEE Transactions on Multimedia10.1109/TMM.2018.286234921:2(494-509)Online publication date: 1-Feb-2019
https://dl.acm.org/doi/10.1109/TMM.2018.2862349
Shi QKurian GHijaz FDevadas SKhan O(2016)LDACACM Transactions on Architecture and Code Optimization10.1145/298363213:4(1-28)Online publication date: 15-Nov-2016
https://dl.acm.org/doi/10.1145/2983632
Show More Cited By

Index Terms

In-Network Cache Coherence
1. Hardware
  1. Hardware validation
  2. Integrated circuits
    1. Interconnect
    2. Semiconductor memory
      1. Dynamic memory

Recommendations

Improving cache performance with adaptive cache topologies and deferred coherence models
Maintaining Cache Coherence through Compiler-Directed Data Prefetching

In this paper, we propose a compiler-directed cache coherence scheme which makes use of data prefetching to enforce cache coherence in large-scale distributed shared-memory (DSM) systems. TheCache Coherence With Data Prefetching(CCDP) scheme uses ...
Simulation based Performance Study of Cache Coherence Protocols
INIS '15: Proceedings of the 2015 IEEE International Symposium on Nanoelectronic and Information Systems (iNIS)

Cache coherence protocol maintains data consistency between different cores / processors in a shared memory multi-core (MC) / multi-processor (MP) system. Coherency can be achieved at the cost of increased miss rate because of invalidations. Coherency ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture

December 2006

493 pages

ISBN:0769527329

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

IEEE Computer Society

United States

Publication History

Published: 09 December 2006

Check for updates

Qualifiers

Article

Conference

Micro-39

Sponsor:

SIGMICRO

Micro-39: The 39th Annual IEEE/ACM International Symposium on Microarchitecture

December 9 - 13, 2006

Acceptance Rates

MICRO 39 Paper Acceptance Rate 42 of 174 submissions, 24%;

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Upcoming Conference

MICRO '24

Sponsor:
sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

33
Total Citations
View Citations
750
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)1

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Fiolhais LSousa L(2023)Transient-Execution Attacks: A Computer Architect PerspectiveACM Computing Surveys10.1145/360361956:3(1-38)Online publication date: 6-Oct-2023
https://dl.acm.org/doi/10.1145/3603619
Yang JYao ZYang BTan XWang ZZheng Q(2019)Software-Defined Multimedia Streaming System Aided By Variable-Length Interval In-Network CachingIEEE Transactions on Multimedia10.1109/TMM.2018.286234921:2(494-509)Online publication date: 1-Feb-2019
https://dl.acm.org/doi/10.1109/TMM.2018.2862349
Shi QKurian GHijaz FDevadas SKhan O(2016)LDACACM Transactions on Architecture and Code Optimization10.1145/298363213:4(1-28)Online publication date: 15-Nov-2016
https://dl.acm.org/doi/10.1145/2983632
Huang LFettweis GNebel W(2014)Leveraging on-chip networks for efficient prediction on multicore coherenceProceedings of the conference on Design, Automation & Test in Europe10.5555/2616606.2616825(1-4)Online publication date: 24-Mar-2014
https://dl.acm.org/doi/10.5555/2616606.2616825
Ong MChen MTaleb TWang XLeung VPrakash RBoukerche ALi CDressler F(2014)FGPCProceedings of the 17th ACM international conference on Modeling, analysis and simulation of wireless and mobile systems10.1145/2641798.2641837(295-302)Online publication date: 21-Sep-2014
https://dl.acm.org/doi/10.1145/2641798.2641837
Huang LWang ZXiao NWang YDou Q(2014)Integrated Coherence PredictionACM Transactions on Design Automation of Electronic Systems10.1145/261175619:3(1-22)Online publication date: 23-Jun-2014
https://dl.acm.org/doi/10.1145/2611756
Huang LWang ZXiao NBrunvard EStevens KCavallaro JZhang T(2012)An optimized multicore cache coherence design for exploiting communication localityProceedings of the great lakes symposium on VLSI10.1145/2206781.2206797(59-62)Online publication date: 3-May-2012
https://dl.acm.org/doi/10.1145/2206781.2206797
Demetriades SCho S(2012)Predicting Coherence Communication by Tracking Synchronization Points at Run TimeProceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2012.40(351-362)Online publication date: 1-Dec-2012
https://dl.acm.org/doi/10.1109/MICRO.2012.40
Nitta CFarrens MMacdonald KAkella VMarculescu RKishinevsky MGinosar RChatha K(2011)Inferring packet dependencies to improve trace based simulation of on-chip networksProceedings of the Fifth ACM/IEEE International Symposium on Networks-on-Chip10.1145/1999946.1999971(153-160)Online publication date: 1-May-2011
https://dl.acm.org/doi/10.1145/1999946.1999971
Xu YDu YZhang YYang JLowenthal Dde Supinski BMcKee S(2011)A composite and scalable cache coherence protocol for large scale CMPsProceedings of the international conference on Supercomputing10.1145/1995896.1995941(285-294)Online publication date: 31-May-2011
https://dl.acm.org/doi/10.1145/1995896.1995941
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents