research-article

A Novel Hybrid Cache Coherence with Global Snooping for Many-core Architectures

Authors:

Sri Harsha Gade and

Sujay DebAuthors Info & Claims

ACM Transactions on Design Automation of Electronic Systems (TODAES), Volume 27, Issue 1

Article No.: 2, Pages 1 - 31

https://doi.org/10.1145/3462775

Published: 13 September 2021 Publication History

Abstract

Cache coherence ensures correctness of cached data in multi-core processors. Traditional implementations of existing protocols make them unscalable for many core architectures. While snoopy coherence requires unscalable ordered networks, directory coherence is weighed down by high area and energy overheads. In this work, we propose Wireless-enabled Share-aware Hybrid (WiSH) to provide scalable coherence in many core processors. WiSH implements a novel Snoopy over Directory protocol using on-chip wireless links and hierarchical, clustered Network-on-Chip to achieve low-overhead and highly efficient coherence. A local directory protocol maintains coherence within a cluster of cores, while coherence among such clusters is achieved through global snoopy protocol. The ordered network for global snooping is provided through low-latency and low-energy broadcast wireless links. The overheads are further reduced through share-aware cache segmentation to eliminate coherence for private blocks. Evaluations show that WiSH reduces traffic by

and runtime by

, while requiring

smaller storage and

lower energy as compared to existing hierarchical and hybrid coherence protocols. Owing to its modularity, WiSH provides highly efficient and scalable coherence for many core processors.

References

[1]

Niket Agarwal, Li-Shiuan Peh, and Niraj K. Jha. 2009. In-network coherence filtering: Snoopy coherence without broadcasts. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). ACM, New York, NY, 232–243. https://doi.org/10.1145/1669112.1669143

Digital Library

[2]

N. Agarwal, L. S. Peh, and N. K. Jha. 2009. In-network snoop ordering (INSO): Snoopy coherence on unordered interconnects. In Proceedings of the IEEE 15th International Symposium on High Performance Computer Architecture. 67–78. https://doi.org/10.1109/HPCA.2009.4798238

[3]

M. Alisafaee. 2012. Spatiotemporal coherence tracking. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture. 341–350. https://doi.org/10.1109/MICRO.2012.39

Digital Library

[4]

A. Asaduzzaman and K. K. Chidella. 2016. A novel directory-based hybrid cache coherence protocol for shared memory multiprocessors. In Proceedings of the IEEE International Symposium on Phased Array Systems and Technology (PAST). 1–6. https://doi.org/10.1109/ARRAY.2016.7832588

[5]

N. Beck, S. White, M. Paraschou, and S. Naffziger. 2018. “Zeppelin”: An SoC for multichip architectures. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC’18). 40–42. https://doi.org/10.1109/ISSCC.2018.8310173

[6]

Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT’08). ACM, New York, NY, 72–81. https://doi.org/10.1145/1454115.1454128

Digital Library

[7]

Paul Caheny, Marc Casas, Miquel Moretó, Hervé Gloaguen, Maxime Saintes, Eduard Ayguadé, Jesús Labarta, and Mateo Valero. 2016. Reducing cache coherence traffic with hierarchical directory cache and NUMA-aware runtime scheduling. In Proceedings of the International Conference on Parallel Architectures and Compilation (PACT’16). ACM, New York, NY, 275–286. https://doi.org/10.1145/2967938.2967962

Digital Library

[8]

B. Cuesta, A. Ros, M. E. Gomez, A. Robles, and J. Duato. 2011. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11). 93–103.

Digital Library

[9]

B. K. Daya, C. H. O. Chen, S. Subramanian, W. C. Kwon, S. Park, T. Krishna, J. Holt, A. P. Chandrakasan, and L. S. Peh. 2014. SCORPIO: A 36-core research chip demonstrating snoopy coherence on a scalable mesh NoC with in-network ordering. In Proceedings of the ACM/IEEE 41st International Symposium on Computer Architecture (ISCA’14). 25–36. https://doi.org/10.1109/ISCA.2014.6853232

Digital Library

[10]

B. K. Daya, L. S. Peh, and A. P. Chandrakasan. 2017. Low-power on-chip network providing guaranteed services for snoopy coherent and artificial neural network systems. In Proceedings of the 54th ACM/EDAC/IEEE Design Automation Conference (DAC’17). 1–6. https://doi.org/10.1145/3061639.3062278

Digital Library

[11]

S. Deb, K. Chang, X. Yu, S. P. Sah, M. Cosic, A. Ganguly, P. P. Pande, B. Belzer, and D. Heo. 2013. Design of an energy-efficient CMOS-Compatible NoC architecture with millimeter-wave wireless interconnects. IEEE Trans. Comput. 62, 12 (Dec 2013), 2382–2396. https://doi.org/10.1109/TC.2012.224

Digital Library

[12]

Ronald G. Dreslinski, Thomas Manville, Korey Sewell, Reetuparna Das, Nathaniel Pinckney, Sudhir Satpathy, David Blaauw, Dennis Sylvester, and Trevor Mudge. 2012. XPoint Cache: Scaling existing bus-based coherence protocols for 2D and 3D many-core systems. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). ACM, New York, NY, 75–86. https://doi.org/10.1145/2370816.2370829

Digital Library

[13]

S. H. Gade, S. Garg, and S. Deb. 2017. OFDM-based high data rate, fading resilient transceiver for wireless networks-on-chip. In Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI’17). 483–488. https://doi.org/10.1109/ISVLSI.2017.90

[14]

Sri Harsha Gade, Shobha Sundar Ram, and Sujay Deb. 2019. Millimeter wave wireless interconnects in deep submicron chips: Challenges and opportunities. Integration 64 (2019), 127–136. https://doi.org/10.1016/j.vlsi.2018.09.004

[15]

A. Garcia-Guirado, R. Fernandez-Pascual, and J. M. Garcia. 2015. ICCI: In-cache coherence information. IEEE Trans. Comput. 64, 4 (Apr. 2015), 995–1014. https://doi.org/10.1109/TC.2014.2308185

Digital Library

[16]

John L. Hennessy and David A. Patterson. 2011. Computer Architecture, Fifth Edition: A Quantitative Approach (5th ed.). Morgan Kaufmann Publishers, San Francisco, CA.

Digital Library

[17]

Joel Hruska. 2018. Intel Uses New Foveros 3D Chip-Stacking to Build Core, Atom on Same Silicon. ExtremeTech. Retrieved from https://www.extremetech.com/computing/282137-intel-uses-new-foveros-3d-chip-stacking-technology-to-build-core-atom-on-the-same-silicon.

[18]

Libo Huang, Zhiying Wang, Nong Xiao, Yongwen Wang, and Qiang Dou. 2014. Integrated coherence prediction: Towards efficient cache coherence on NoC-based multicore architectures. ACM Trans. Des. Autom. Electron. Syst. 19, 3, Article 24 (June 2014), 22 pages. https://doi.org/10.1145/2611756

Digital Library

[19]

S. Kaushik, M. Agrawal, H. K. Mondal, S. H. Gade, and S. Deb. 2017. Path loss-aware adaptive transmission power control scheme for energy-efficient wireless NoC. In Proceedings of the IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS’17). 132–135.

[20]

Abdullah Kayi and Tarek El-Ghazawi. 2010. An adaptive cache coherence protocol for chip multiprocessors. In Proceedings of the 2nd International Forum on Next-Generation Multicore/Manycore Technologies (IFMT’10). ACM, New York, NY, Article 4, 10 pages. https://doi.org/10.1145/1882453.1882458

Digital Library

[21]

A. Kayi, O. Serres, and T. El-Ghazawi. 2015. Adaptive cache coherence mechanisms with producer-consumer sharing optimization for chip multiprocessors. IEEE Trans. Comput. 64, 2 (Feb. 2015), 316–328. https://doi.org/10.1109/TC.2013.217

Digital Library

[22]

George Kurian, Jason E. Miller, James Psota, Jonathan Eastep, Jifeng Liu, Jurgen Michel, Lionel C. Kimerling, and Anant Agarwal. 2010. ATAC: A 1000-core cache-coherent processor with on-chip optical network. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). ACM, New York, NY, 477–488. https://doi.org/10.1145/1854273.1854332

Digital Library

[23]

S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). 469–480.

Digital Library

[24]

R. Mahajan, R. Sankman, N. Patel, D. Kim, K. Aygun, Z. Qian, Y. Mekonnen, I. Salama, S. Sharan, D. Iyengar, and D. Mallik. 2016. Embedded multi-die interconnect bridge (EMIB)—A high-density, high-bandwidth packaging interconnect. In Proceedings of the IEEE 66th Electronic Components and Technology Conference (ECTC’16). 557–565. https://doi.org/10.1109/ECTC.2016.201

[25]

Ofer Markish, Oded Katz, Benny Sheinman, Dan Corcos, and Danny Elad. 2015. On-chip millimeter wave antennas and transceivers. In Proceedings of the 9th International Symposium on Networks-on-Chip (NOCS’15). ACM, New York, NY, Article 11, 7 pages. https://doi.org/10.1145/2786572.2789983

Digital Library

[26]

Milo M. K. Martin, Mark D. Hill, and Daniel J. Sorin. 2012. Why On-chip cache coherence is here to stay. Commun. ACM 55, 7 (July 2012), 78–89. https://doi.org/10.1145/2209249.2209269

Digital Library

[27]

M. M. K. Martin, M. D. Hill, and D. A. Wood. 2003. Token coherence: decoupling performance and correctness. In Proceedings of the 30th Annual International Symposium on Computer Architecture. 182–193. https://doi.org/10.1109/ISCA.2003.1206999

Digital Library

[28]

Norman P. Jouppi Naveen Muralimanohar, and Rajeev Balasubramonian. 2009. CACTI 6.0: A Tool to Model Large Caches. Retrieved from https://www.hpl.hp.com/techreports/2009/HPL-2009-85.html.

[29]

Yin-Chi Peng, Chien-Chih Chen, Hsiang-Jen Tsai, Keng-Hao Yang, Pei-Zhe Huang, Shih-Chieh Chang, Wen-Ben Jone, and Tien-Fu Chen. 2017. Leak Stopper: An actively revitalized snoop filter architecture with effective generation control. ACM Trans. Des. Autom. Electron. Syst. 22, 3, Article 46 (Mar. 2017), 27 pages. https://doi.org/10.1145/3015770

Digital Library

[30]

A. Ros, M. E. Acacio, and J. M. Garcia. 2010. A direct coherence protocol for many-core chip multiprocessors. IEEE Trans. Parallel Distrib. Syst. 21, 12 (Dec. 2010), 1779–1792. https://doi.org/10.1109/TPDS.2010.43

Digital Library

[31]

A. Ros, M. Davari, and S. Kaxiras. 2015. Hierarchical private/shared classification: The key to simple and efficient coherence for clustered cache hierarchies. In Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). 186–197. https://doi.org/10.1109/HPCA.2015.7056032

[32]

A. Ros and A. Jimborean. 2016. A hybrid static-dynamic classification for dual-consistency cache coherence. IEEE Trans. Parallel Distrib. Syst. 27, 11 (Nov. 2016), 3101–3115. https://doi.org/10.1109/TPDS.2016.2528241

Digital Library

[33]

D. Sanchez and C. Kozyrakis. 2012. SCD: A scalable coherence directory with flexible sharer set encoding. In Proceedings of the IEEE International Symposium on High-Performance Comp Architecture. 1–12. https://doi.org/10.1109/HPCA.2012.6168950

Digital Library

[34]

David Schor. 2018. AMD Announces Threadripper 2, Chiplets Aid Core Scaling. WikiChip. Retrieved from https://fuse.wikichip.org/news/1569/amd-announces-threadripper-2-chiplets-aid-core-scaling/.

[35]

T. Shreedhar and S. Deb. 2016. Hierarchical cluster-based NOC design using wireless interconnects for coherence support. In Proceedings of the 29th International Conference on VLSI Design and 2016 15th International Conference on Embedded Systems (VLSID’16). 63–68. https://doi.org/10.1109/VLSID.2016.54

Digital Library

[36]

A. Sodani, R. Gramunt, J. Corbal, H. S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y. C. Liu. 2016. Knights landing: Second-generation intel xeon phi product. IEEE Micro 36, 2 (Mar. 2016), 34–46. https://doi.org/10.1109/MM.2016.25

Digital Library

[37]

K. Strauss, X. Shen, and J. Torrellas. 2007. Uncorq: Unconstrained snoop request delivery in embedded-ring multiprocessors. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’07). 327–342. https://doi.org/10.1109/MICRO.2007.37

Digital Library

[38]

Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, and David Kaeli. 2012. Multi2Sim: A simulation framework for CPU-GPU computing. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). ACM, New York, NY, 335–344. https://doi.org/10.1145/2370816.2370865

Digital Library

[39]

S. Volos, C. Seiculescu, B. Grot, N. K. Pour, B. Falsafi, and G. De Micheli. 2012. CCNoC: Specializing on-chip interconnects for energy efficiency in cache-coherent servers. In Proceedings of the 6th IEEE/ACM International Symposium on Networks on Chip (NoCS’12). 67–74. https://doi.org/10.1109/NOCS.2012.15

Digital Library

[40]

S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd International Symposium on Computer Architecture.

Digital Library

[41]

J. Zebchuk, B. Falsafi, and A. Moshovos. 2013. Multi-grain coherence directories. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’13). 359–370.

Digital Library

[42]

J. Zebchuk, M. K. Qureshi, V. Srinivasan, and A. Moshovos. 2009. A tagless coherence directory. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). 423–434. https://doi.org/10.1145/1669112.1669166

Digital Library

[43]

H. Zhao, O. Jang, W. Ding, Y. Zhang, M. Kandemir, and M. J. Irwin. 2012. A hybrid NoC design for cache coherence optimization for chip multiprocessors. In Proceedings of the DAC Design Automation Conference. 834–842.

Digital Library

[44]

Hongzhou Zhao, Arrvindh Shriraman, and Sandhya Dwarkadas. 2010. SPACE: Sharing pattern-based directory coherence for multicore scalability. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). ACM, New York, NY, 135–146. https://doi.org/10.1145/1854273.1854294

Digital Library

[45]

H. Zhao, A. Shriraman, S. Dwarkadas, and V. Srinivasan. 2011. SPATL: Honey, I shrunk the coherence directory. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’11). 33–44. https://doi.org/10.1109/PACT.2011.10

Digital Library

[46]

Xiangrong Zhou, Chenjie Yu, Alokika Dash, and Peter Petrov. 2008. Application-aware snoop filtering for low-power cache coherence in embedded multiprocessors. ACM Trans. Des. Autom. Electron. Syst. 13, 1, Article 16 (Feb. 2008), 25 pages. https://doi.org/10.1145/1297666.1297682

Digital Library

Index Terms

A Novel Hybrid Cache Coherence with Global Snooping for Many-core Architectures
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Interconnection architectures
      2. Multicore architectures

Recommendations

Performance Analysis of Cache Coherence Protocols for Multi-core Architectures: A System Attribute Perspective
AICTC '16: Proceedings of the International Conference on Advances in Information Communication Technology & Computing

Shared memory multi-core processors are becoming dominant in todays computer architectures. Caching of shared data may produce a problem of replication in multiple caches. Replication provides reduction in contention for shared data items along with ...
Read More
Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table
CF '16: Proceedings of the ACM International Conference on Computing Frontiers

Chip multiprocessors (CMPs) require effective cache coherence protocols as well as fast virtual-to-physical address translation mechanisms for high performance. Directory-based cache coherence protocols are the state-of-the-art approaches in many-core ...
Read More
Efficient Timestamp-Based Cache Coherence Protocol for Many-Core Architectures
ICS '16: Proceedings of the 2016 International Conference on Supercomputing

As we enter the era of many-core, providing the shared memory abstraction through cache coherence has become progressively difficult. The de-facto standard directory-based cache coherence has been extensively studied; but it does not scale well with ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems

ACM Transactions on Design Automation of Electronic Systems Volume 27, Issue 1

January 2022

230 pages

ISSN:1084-4309

EISSN:1557-7309

DOI:10.1145/3483335

Editor:
X. Sharon Hu
University of Notre Dame, USA

Issue’s Table of Contents

Copyright © 2021 Association for Computing Machinery.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 13 September 2021

Accepted: 01 April 2021

Revised: 01 March 2021

Received: 01 May 2020

Published in TODAES Volume 27, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
385
Total Downloads

Downloads (Last 12 months)99
Downloads (Last 6 weeks)6

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents