Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A Novel Hybrid Cache Coherence with Global Snooping for Many-core Architectures

Published: 13 September 2021 Publication History
  • Get Citation Alerts
  • Abstract

    Cache coherence ensures correctness of cached data in multi-core processors. Traditional implementations of existing protocols make them unscalable for many core architectures. While snoopy coherence requires unscalable ordered networks, directory coherence is weighed down by high area and energy overheads. In this work, we propose Wireless-enabled Share-aware Hybrid (WiSH) to provide scalable coherence in many core processors. WiSH implements a novel Snoopy over Directory protocol using on-chip wireless links and hierarchical, clustered Network-on-Chip to achieve low-overhead and highly efficient coherence. A local directory protocol maintains coherence within a cluster of cores, while coherence among such clusters is achieved through global snoopy protocol. The ordered network for global snooping is provided through low-latency and low-energy broadcast wireless links. The overheads are further reduced through share-aware cache segmentation to eliminate coherence for private blocks. Evaluations show that WiSH reduces traffic by and runtime by , while requiring smaller storage and lower energy as compared to existing hierarchical and hybrid coherence protocols. Owing to its modularity, WiSH provides highly efficient and scalable coherence for many core processors.

    References

    [1]
    Niket Agarwal, Li-Shiuan Peh, and Niraj K. Jha. 2009. In-network coherence filtering: Snoopy coherence without broadcasts. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). ACM, New York, NY, 232–243. https://doi.org/10.1145/1669112.1669143
    [2]
    N. Agarwal, L. S. Peh, and N. K. Jha. 2009. In-network snoop ordering (INSO): Snoopy coherence on unordered interconnects. In Proceedings of the IEEE 15th International Symposium on High Performance Computer Architecture. 67–78. https://doi.org/10.1109/HPCA.2009.4798238
    [3]
    M. Alisafaee. 2012. Spatiotemporal coherence tracking. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture. 341–350. https://doi.org/10.1109/MICRO.2012.39
    [4]
    A. Asaduzzaman and K. K. Chidella. 2016. A novel directory-based hybrid cache coherence protocol for shared memory multiprocessors. In Proceedings of the IEEE International Symposium on Phased Array Systems and Technology (PAST). 1–6. https://doi.org/10.1109/ARRAY.2016.7832588
    [5]
    N. Beck, S. White, M. Paraschou, and S. Naffziger. 2018. “Zeppelin”: An SoC for multichip architectures. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC’18). 40–42. https://doi.org/10.1109/ISSCC.2018.8310173
    [6]
    Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT’08). ACM, New York, NY, 72–81. https://doi.org/10.1145/1454115.1454128
    [7]
    Paul Caheny, Marc Casas, Miquel Moretó, Hervé Gloaguen, Maxime Saintes, Eduard Ayguadé, Jesús Labarta, and Mateo Valero. 2016. Reducing cache coherence traffic with hierarchical directory cache and NUMA-aware runtime scheduling. In Proceedings of the International Conference on Parallel Architectures and Compilation (PACT’16). ACM, New York, NY, 275–286. https://doi.org/10.1145/2967938.2967962
    [8]
    B. Cuesta, A. Ros, M. E. Gomez, A. Robles, and J. Duato. 2011. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11). 93–103.
    [9]
    B. K. Daya, C. H. O. Chen, S. Subramanian, W. C. Kwon, S. Park, T. Krishna, J. Holt, A. P. Chandrakasan, and L. S. Peh. 2014. SCORPIO: A 36-core research chip demonstrating snoopy coherence on a scalable mesh NoC with in-network ordering. In Proceedings of the ACM/IEEE 41st International Symposium on Computer Architecture (ISCA’14). 25–36. https://doi.org/10.1109/ISCA.2014.6853232
    [10]
    B. K. Daya, L. S. Peh, and A. P. Chandrakasan. 2017. Low-power on-chip network providing guaranteed services for snoopy coherent and artificial neural network systems. In Proceedings of the 54th ACM/EDAC/IEEE Design Automation Conference (DAC’17). 1–6. https://doi.org/10.1145/3061639.3062278
    [11]
    S. Deb, K. Chang, X. Yu, S. P. Sah, M. Cosic, A. Ganguly, P. P. Pande, B. Belzer, and D. Heo. 2013. Design of an energy-efficient CMOS-Compatible NoC architecture with millimeter-wave wireless interconnects. IEEE Trans. Comput. 62, 12 (Dec 2013), 2382–2396. https://doi.org/10.1109/TC.2012.224
    [12]
    Ronald G. Dreslinski, Thomas Manville, Korey Sewell, Reetuparna Das, Nathaniel Pinckney, Sudhir Satpathy, David Blaauw, Dennis Sylvester, and Trevor Mudge. 2012. XPoint Cache: Scaling existing bus-based coherence protocols for 2D and 3D many-core systems. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). ACM, New York, NY, 75–86. https://doi.org/10.1145/2370816.2370829
    [13]
    S. H. Gade, S. Garg, and S. Deb. 2017. OFDM-based high data rate, fading resilient transceiver for wireless networks-on-chip. In Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI’17). 483–488. https://doi.org/10.1109/ISVLSI.2017.90
    [14]
    Sri Harsha Gade, Shobha Sundar Ram, and Sujay Deb. 2019. Millimeter wave wireless interconnects in deep submicron chips: Challenges and opportunities. Integration 64 (2019), 127–136. https://doi.org/10.1016/j.vlsi.2018.09.004
    [15]
    A. Garcia-Guirado, R. Fernandez-Pascual, and J. M. Garcia. 2015. ICCI: In-cache coherence information. IEEE Trans. Comput. 64, 4 (Apr. 2015), 995–1014. https://doi.org/10.1109/TC.2014.2308185
    [16]
    John L. Hennessy and David A. Patterson. 2011. Computer Architecture, Fifth Edition: A Quantitative Approach (5th ed.). Morgan Kaufmann Publishers, San Francisco, CA.
    [17]
    Joel Hruska. 2018. Intel Uses New Foveros 3D Chip-Stacking to Build Core, Atom on Same Silicon. ExtremeTech. Retrieved from https://www.extremetech.com/computing/282137-intel-uses-new-foveros-3d-chip-stacking-technology-to-build-core-atom-on-the-same-silicon.
    [18]
    Libo Huang, Zhiying Wang, Nong Xiao, Yongwen Wang, and Qiang Dou. 2014. Integrated coherence prediction: Towards efficient cache coherence on NoC-based multicore architectures. ACM Trans. Des. Autom. Electron. Syst. 19, 3, Article 24 (June 2014), 22 pages. https://doi.org/10.1145/2611756
    [19]
    S. Kaushik, M. Agrawal, H. K. Mondal, S. H. Gade, and S. Deb. 2017. Path loss-aware adaptive transmission power control scheme for energy-efficient wireless NoC. In Proceedings of the IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS’17). 132–135.
    [20]
    Abdullah Kayi and Tarek El-Ghazawi. 2010. An adaptive cache coherence protocol for chip multiprocessors. In Proceedings of the 2nd International Forum on Next-Generation Multicore/Manycore Technologies (IFMT’10). ACM, New York, NY, Article 4, 10 pages. https://doi.org/10.1145/1882453.1882458
    [21]
    A. Kayi, O. Serres, and T. El-Ghazawi. 2015. Adaptive cache coherence mechanisms with producer-consumer sharing optimization for chip multiprocessors. IEEE Trans. Comput. 64, 2 (Feb. 2015), 316–328. https://doi.org/10.1109/TC.2013.217
    [22]
    George Kurian, Jason E. Miller, James Psota, Jonathan Eastep, Jifeng Liu, Jurgen Michel, Lionel C. Kimerling, and Anant Agarwal. 2010. ATAC: A 1000-core cache-coherent processor with on-chip optical network. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). ACM, New York, NY, 477–488. https://doi.org/10.1145/1854273.1854332
    [23]
    S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). 469–480.
    [24]
    R. Mahajan, R. Sankman, N. Patel, D. Kim, K. Aygun, Z. Qian, Y. Mekonnen, I. Salama, S. Sharan, D. Iyengar, and D. Mallik. 2016. Embedded multi-die interconnect bridge (EMIB)—A high-density, high-bandwidth packaging interconnect. In Proceedings of the IEEE 66th Electronic Components and Technology Conference (ECTC’16). 557–565. https://doi.org/10.1109/ECTC.2016.201
    [25]
    Ofer Markish, Oded Katz, Benny Sheinman, Dan Corcos, and Danny Elad. 2015. On-chip millimeter wave antennas and transceivers. In Proceedings of the 9th International Symposium on Networks-on-Chip (NOCS’15). ACM, New York, NY, Article 11, 7 pages. https://doi.org/10.1145/2786572.2789983
    [26]
    Milo M. K. Martin, Mark D. Hill, and Daniel J. Sorin. 2012. Why On-chip cache coherence is here to stay. Commun. ACM 55, 7 (July 2012), 78–89. https://doi.org/10.1145/2209249.2209269
    [27]
    M. M. K. Martin, M. D. Hill, and D. A. Wood. 2003. Token coherence: decoupling performance and correctness. In Proceedings of the 30th Annual International Symposium on Computer Architecture. 182–193. https://doi.org/10.1109/ISCA.2003.1206999
    [28]
    Norman P. Jouppi Naveen Muralimanohar, and Rajeev Balasubramonian. 2009. CACTI 6.0: A Tool to Model Large Caches. Retrieved from https://www.hpl.hp.com/techreports/2009/HPL-2009-85.html.
    [29]
    Yin-Chi Peng, Chien-Chih Chen, Hsiang-Jen Tsai, Keng-Hao Yang, Pei-Zhe Huang, Shih-Chieh Chang, Wen-Ben Jone, and Tien-Fu Chen. 2017. Leak Stopper: An actively revitalized snoop filter architecture with effective generation control. ACM Trans. Des. Autom. Electron. Syst. 22, 3, Article 46 (Mar. 2017), 27 pages. https://doi.org/10.1145/3015770
    [30]
    A. Ros, M. E. Acacio, and J. M. Garcia. 2010. A direct coherence protocol for many-core chip multiprocessors. IEEE Trans. Parallel Distrib. Syst. 21, 12 (Dec. 2010), 1779–1792. https://doi.org/10.1109/TPDS.2010.43
    [31]
    A. Ros, M. Davari, and S. Kaxiras. 2015. Hierarchical private/shared classification: The key to simple and efficient coherence for clustered cache hierarchies. In Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). 186–197. https://doi.org/10.1109/HPCA.2015.7056032
    [32]
    A. Ros and A. Jimborean. 2016. A hybrid static-dynamic classification for dual-consistency cache coherence. IEEE Trans. Parallel Distrib. Syst. 27, 11 (Nov. 2016), 3101–3115. https://doi.org/10.1109/TPDS.2016.2528241
    [33]
    D. Sanchez and C. Kozyrakis. 2012. SCD: A scalable coherence directory with flexible sharer set encoding. In Proceedings of the IEEE International Symposium on High-Performance Comp Architecture. 1–12. https://doi.org/10.1109/HPCA.2012.6168950
    [34]
    David Schor. 2018. AMD Announces Threadripper 2, Chiplets Aid Core Scaling. WikiChip. Retrieved from https://fuse.wikichip.org/news/1569/amd-announces-threadripper-2-chiplets-aid-core-scaling/.
    [35]
    T. Shreedhar and S. Deb. 2016. Hierarchical cluster-based NOC design using wireless interconnects for coherence support. In Proceedings of the 29th International Conference on VLSI Design and 2016 15th International Conference on Embedded Systems (VLSID’16). 63–68. https://doi.org/10.1109/VLSID.2016.54
    [36]
    A. Sodani, R. Gramunt, J. Corbal, H. S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y. C. Liu. 2016. Knights landing: Second-generation intel xeon phi product. IEEE Micro 36, 2 (Mar. 2016), 34–46. https://doi.org/10.1109/MM.2016.25
    [37]
    K. Strauss, X. Shen, and J. Torrellas. 2007. Uncorq: Unconstrained snoop request delivery in embedded-ring multiprocessors. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’07). 327–342. https://doi.org/10.1109/MICRO.2007.37
    [38]
    Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, and David Kaeli. 2012. Multi2Sim: A simulation framework for CPU-GPU computing. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). ACM, New York, NY, 335–344. https://doi.org/10.1145/2370816.2370865
    [39]
    S. Volos, C. Seiculescu, B. Grot, N. K. Pour, B. Falsafi, and G. De Micheli. 2012. CCNoC: Specializing on-chip interconnects for energy efficiency in cache-coherent servers. In Proceedings of the 6th IEEE/ACM International Symposium on Networks on Chip (NoCS’12). 67–74. https://doi.org/10.1109/NOCS.2012.15
    [40]
    S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd International Symposium on Computer Architecture.
    [41]
    J. Zebchuk, B. Falsafi, and A. Moshovos. 2013. Multi-grain coherence directories. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’13). 359–370.
    [42]
    J. Zebchuk, M. K. Qureshi, V. Srinivasan, and A. Moshovos. 2009. A tagless coherence directory. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). 423–434. https://doi.org/10.1145/1669112.1669166
    [43]
    H. Zhao, O. Jang, W. Ding, Y. Zhang, M. Kandemir, and M. J. Irwin. 2012. A hybrid NoC design for cache coherence optimization for chip multiprocessors. In Proceedings of the DAC Design Automation Conference. 834–842.
    [44]
    Hongzhou Zhao, Arrvindh Shriraman, and Sandhya Dwarkadas. 2010. SPACE: Sharing pattern-based directory coherence for multicore scalability. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). ACM, New York, NY, 135–146. https://doi.org/10.1145/1854273.1854294
    [45]
    H. Zhao, A. Shriraman, S. Dwarkadas, and V. Srinivasan. 2011. SPATL: Honey, I shrunk the coherence directory. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’11). 33–44. https://doi.org/10.1109/PACT.2011.10
    [46]
    Xiangrong Zhou, Chenjie Yu, Alokika Dash, and Peter Petrov. 2008. Application-aware snoop filtering for low-power cache coherence in embedded multiprocessors. ACM Trans. Des. Autom. Electron. Syst. 13, 1, Article 16 (Feb. 2008), 25 pages. https://doi.org/10.1145/1297666.1297682

    Index Terms

    1. A Novel Hybrid Cache Coherence with Global Snooping for Many-core Architectures

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Transactions on Design Automation of Electronic Systems
        ACM Transactions on Design Automation of Electronic Systems  Volume 27, Issue 1
        January 2022
        230 pages
        ISSN:1084-4309
        EISSN:1557-7309
        DOI:10.1145/3483335
        Issue’s Table of Contents
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Journal Family

        Publication History

        Published: 13 September 2021
        Accepted: 01 April 2021
        Revised: 01 March 2021
        Received: 01 May 2020
        Published in TODAES Volume 27, Issue 1

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Cache coherence
        2. hybrid protocol
        3. many core processors
        4. mm-wave wireless links

        Qualifiers

        • Research-article
        • Refereed

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 385
          Total Downloads
        • Downloads (Last 12 months)99
        • Downloads (Last 6 weeks)6

        Other Metrics

        Citations

        View Options

        Get Access

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media