Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Integrated Coherence Prediction: Towards Efficient Cache Coherence on NoC-Based Multicore Architectures

Published: 23 June 2014 Publication History
  • Get Citation Alerts
  • Abstract

    Multicore architectures with Network-on-Chips (NoCs) have been widely recognized as the de facto design for the efficient utilization of the continuously increasing density of transistors on a chip. A key challenge in designing such an NoC-based multicore processor is maintaining cache coherence in an efficient manner. Directory-based protocols avoid the bandwidth overhead of snoop-based protocols, therefore scaling to a large number of cores. However, conventional directory structures add significant indirection delay to cache-to-cache accesses in larger multicore processor.
    In this article we propose a novel hardware coherence technique, called integrated coherence prediction (ICP). This approach adopts a prediction technique for managing shared data to reduce or eliminate the cache-to-cache delay in coherence accesses. ICP has two unique features that differ from previous coherence prediction techniques. First, ICP introduces a new integrated prediction scheme that combines two kinds of predictors: owner predictor, which predicts the data writers and avoids the indirection through directory, and data predictor, which predicts the access address and prefetches data from remote nodes directly. Second, ICP uses a request replication method to reduce the negative effect of wrong owner prediction operations, thus facilitating overall performance improvement. We present the design and implementation details of the ICP approach. Using detailed full-system simulations, we conclude that the ICP provides a cost-effective solution for designing high-performance multicore processors.

    References

    [1]
    Hazim Abdel-Shafi, Jonathan Hall, Sarita V. Adve, and Vikram S. Adve. 1997. An evaluation of fine-grain producer-initiated communication in cache-coherent multiprocessors. In Proceedings of the 3rd International Symposium on High-Performance Computer Architecture (HPCA'97). 204--215.
    [2]
    Manuel E. Acacio, Jose Gonzalez, Jose M. Garcia, and Jose Duato. 2002a. Owner prediction for accelerating cache-to-cache transfer misses in a CC-NUMA architecture. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC'02). IEEE Computer Society Press, 1--12.
    [3]
    Manuel E. Acacio, Jose Gonzalez, Jose M. Garcia, and Jose Duato. 2002b. The use of prediction for accelerating upgrade misses in CC-NUMA multiprocessors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT'02). IEEE Computer Society, 155--164.
    [4]
    Ehsan Atoofian and Amirali Baniasadi. 2007. A power-aware prediction-based cache coherence protocol for chip multiprocessors. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS'07). 1--8.
    [5]
    Nathan L. Binkert, Ronald G. Dreslinski, Lisa R. Hsu, Kevin T. Lim, Ali G. Saidi, and Steven K. Reinhardt. 2006. The m5 simulator: Modeling networked systems. IEEE Micro 26, 4, 52--60.
    [6]
    CACTI. 2013. An integrated cache and memory access time, cycle time, area, leakage, and dynamic power model. http://www.hpl.hp.com/research/cacti/.
    [7]
    Liqun Cheng, John B. Carter, and Donglai Dai. 2007. An adaptive cache coherence protocol optimized for producer-consumer sharing. In Proceedings of the IEEE 13th International Symposium on High Performance Computer Architecture (HPCA'07). IEEE Computer Society, 328--339.
    [8]
    Liqun Cheng, Naveen Muralimanohar, Karthik Ramani, Rajeev Balasubramonian, and John B. Carter. 2006. Interconnect-aware coherence protocols for chip multiprocessors. In Proceedings of the 33rd Annual International Symposium on Computer Architecture (ISCA'06). 339--351.
    [9]
    Socrates Demetriades and Sangyeun Cho. 2012. Predicting coherence communication by tracking synchronization points at run time. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'12). IEEE Computer Society, 351--362.
    [10]
    David L. Dill. 1996. The Mur φ verification system. In Proceedings of the 8th International Conference on Computer Aided Verification (CAV'96). Springer, 390--393.
    [11]
    Noel Eisley, Li-Shiuan Peh, and Li Shang. 2006. In-network cache coherence. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06). 321--332.
    [12]
    Natalie D. Enright Jerger, Li-Shiuan Peh, and Mikko H. Lipasti. 2008. Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence. In Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'08). IEEE Computer Society, 35--46.
    [13]
    Hemayet Hossain, Sandhya Dwarkadas, and Michael C. Huang. 2008. Improving support for locality and fine-grain sharing in chip multiprocessors. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT'08). ACM Press, New York, 155--165.
    [14]
    Libo Huang, Zhiying Wang, and Nong Xiao. 2012. An optimized multicore cache coherence design for exploiting communication locality. In Proceedings of the Great Lakes Symposium on VLSI (GLSVLSI'12). ACM Press, New York, 59--62.
    [15]
    Libo Huang, Zhiying Wang, and Nong Xiao. 2013. VBON: Toward efficient on-chip networks via hierarchical virtual bus. Micro. Microsyst.- Embed. Hardware Des. 37, 8-B, 915--928.
    [16]
    Jaehyuk Huh, Changkyu Kim, Hazim Shafi, Lixin Zhang, Doug Burger, and Stephen W. Keckler. 2005. ANUCA substrate for flexible CMP cache sharing. In Proceedings of the 19th International Conference on Supercomputing (ICS'05). ACM Press, New York, 31--40.
    [17]
    Ravi Iyer and Laxmi Narayan Bhuyan. 1999. Switch cache: A framework for improving the remote memory access latency of CC-NUMA multiprocessors. In Proceedings of the 5th International Symposium on High Performance Computer Architecture (HPCA'99). 152--160.
    [18]
    Norman P. Jouppi. 1990. Improving direct-mapped cache performance by the addition of a small fully- associative cache and prefetch buffers. In Proceedings of the 17th Annual International Symposium on Computer Architecture (ISCA'90). ACM Press, New York, 364--373.
    [19]
    Andrew B. Kahng, Bin Li, Li-Shiuan Peh, and Kambiz Samadi. 2009. ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration. In Proceedings of the Design, Automation and Test in Europe Conference (DATE'09). 423--428.
    [20]
    Stefanos Kaxiras and James R. Goodman. 1999. Improving CC-NUMA performance using instruction-based prediction. In Proceedings of the 5th International Symposium on High Performance Computer Architecture (HPCA'99). IEEE Computer Society, 161.
    [21]
    Stefanos Kaxiras and Georgios Keramidas. 2010. SARC coherence: Scaling directory cache coherence in performance and power. IEEE Micro 30, 5, 54--65.
    [22]
    Stefanos Kaxiras and Cliff Young. 2000. Coherence communication prediction in shared-memory multiprocessors. In Proceedings of the 6th International Symposium on High Performance Computer Architecture (HPCA'00). 156.
    [23]
    Abdullah Kayi and Tarek El-Ghazawi. 2010. An adaptive cache coherence protocol for chip multiprocessors. In Proceedings of the 2nd International Forum on Next-Generation Multicore/Manycore Technologies (IFMT'10). ACM Press, New York.
    [24]
    David A. Koufaty, Xiangfeng Chen, David K. Poulsen, and Josep Torrellas. 1996. Data forwarding in scalable shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst. 7, 12, 1250--1264.
    [25]
    George Kurian, Omer Khan, and Srinivas Devadas. 2013. The locality-aware adaptive cache coherence protocol. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA'13). ACM Press, New York, 523--534.
    [26]
    Jeffery Kuskin, David Ofelt, Mark Heinrich, John Heinlein, Richard Simoni, Kourosh Gharachorloo, John Chapin, David Nakahira, Joel Baxter, Mark Horowitz, Anoop Gupta, Mendel Rosenblum, and John Hennessy. 1994. The Stanford flash multiprocessor. In Proceedings of the 21st Annual International Symposium on Computer Architecture (ISCA'94). IEEE Computer Society Press, 302--313.
    [27]
    An-Chow Lai and Babak Falsafi. 1999. Memory sharing predictor: The key to a speculative coherent DSM. In Proceedings of the 26th Annual International Symposium on Computer Architecture (ISCA'99). 172--183.
    [28]
    James Laudon and Daniel Lenoski. 1997. The SGI origin: A CC-NUMA highly scalable server. In Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA'97). 241--251.
    [29]
    Daniel Lenoski, James Laudon, Kourosh Gharachorloo, Wolf-Dietrich Weber, Anoop Gupta, John Hennessy, Mark Horowitz, and Monica S. Lam. 1992. The Stanford dash multiprocessor. Comput. 25, 3, 63--79.
    [30]
    Sean Leventhal and Manoj Franklin. 2006. Perceptron based consumer prediction in shared-memory multiprocessors. In Proceedings of the International Conference on Computer Design (ICCD'06). 148--154.
    [31]
    Mario Lodde, Antoni Roca, and Jose Flich. 2013. Built-in fast gather control network for efficient support of coherence protocols. IET Comput. Digit. Techniques 7, 2.
    [32]
    Tom Lovett and Russell Clapp. 1996. STiNG: A CC-NUMA computer system for the commercial marketplace. In Proceedings of the 23rd Annual International Symposium on Computer Architecture (ISCA'96). ACM Press, New York, 308--317.
    [33]
    Milo M. K. Martin, Pacia J. Harper, Daniel J. Sorin, Mark D. Hill, and David A. Wood. 2003. Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors. In Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA'03). ACM Press, New York, 206--217.
    [34]
    Maged M. Michael and Ashwini K. Nanda. 1999. Design and performance of directory caches for scalable shared memory multiprocessors. In Proceedings of the 5th International Symposium on High Performance Computer Architecture (HPCA'99). IEEE Computer Society, 142.
    [35]
    Shubhendu S. Mukherjee and Mark D. Hill. 1998. Using prediction to accelerate coherence protocols. In Proceedings of the 25th Annual International Symposium on Computer Architecture (ISCA'98). 179--190.
    [36]
    Subbarao Palacharla and Richard E. Kessler. 1994. Evaluating stream buffers as a secondary cache replacement. In Proceedings of the 21st Annual International Symposium on Computer Architecture (ISCA'94). 24--33.
    [37]
    Alberto Ros, Manuel E. Acacio, and Jose M. Garcia. 2008. DiCo-CMP: Efficient cache coherency in tiled CMP architectures. In Proceedings of the IEEE Symposium on Parallel and Distributed Processing (IPDPS'08). 1--11.
    [38]
    Robert Stets, Sandhya Dwarkadas, Leonidas I. Kontothanassis, Umit Rencuzogullari, and Michael L. Scott. 2000. The effect of network total order, broadcast, and remote-write capability on network-based shared memory computing. In Proceedings of the 6th International Symposium on High Performance Computer Architecture (HPCA'00). IEEE Computer Society, 265--276.
    [39]
    Thomas F. Wenisch, Stephen Somogyi, Nikolaos Hardavellas, Jangwoo Kim, Chris Gniady, Anastassia Ailamaki, and Babak Falsafi. 2005. Store-ordered streaming of shared memory. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05). 75--86.
    [40]
    David Wentzlaff, Patrick Griffin, Henry Hoffmann, Liewei Bao, Bruce Edwards, Carl Ramey, Matthew Mattina, Chyi-Chang Miao, John F. Brown Iii, and Anant Agarwal. 2007. On-chip interconnection architecture of the tile processor. IEEE Micro 27, 5, 15--31.
    [41]
    Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA'95). ACM Press, New York, 24--36.
    [42]
    Tse-Yu Yeh and Yale N. Patt. 1992. Alternative implementations of two-level adaptive branch prediction. In Proceedings of the 19th Annual International Symposium on Computer Architecture (ISCA'92). ACM Press, New York, 124--134.

    Cited By

    View all
    • (2018)Research of Configurable Hybrid Memory Architecture for Big Data ProcessingComputer Engineering and Technology10.1007/978-981-10-7844-6_12(116-132)Online publication date: 3-Jan-2018
    • (2017)vFlash: Virtualized Flash for Optimizing the I/O Performance in Mobile DevicesIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2016.261888136:7(1203-1214)Online publication date: 16-Jun-2017

    Index Terms

    1. Integrated Coherence Prediction: Towards Efficient Cache Coherence on NoC-Based Multicore Architectures

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Transactions on Design Automation of Electronic Systems
        ACM Transactions on Design Automation of Electronic Systems  Volume 19, Issue 3
        June 2014
        257 pages
        ISSN:1084-4309
        EISSN:1557-7309
        DOI:10.1145/2634048
        Issue’s Table of Contents
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Journal Family

        Publication History

        Published: 23 June 2014
        Accepted: 01 February 2014
        Revised: 01 December 2013
        Received: 01 March 2013
        Published in TODAES Volume 19, Issue 3

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Cache coherence
        2. multicore
        3. network-on-chip
        4. prediction

        Qualifiers

        • Research-article
        • Research
        • Refereed

        Funding Sources

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)29
        • Downloads (Last 6 weeks)1

        Other Metrics

        Citations

        Cited By

        View all
        • (2018)Research of Configurable Hybrid Memory Architecture for Big Data ProcessingComputer Engineering and Technology10.1007/978-981-10-7844-6_12(116-132)Online publication date: 3-Jan-2018
        • (2017)vFlash: Virtualized Flash for Optimizing the I/O Performance in Mobile DevicesIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2016.261888136:7(1203-1214)Online publication date: 16-Jun-2017

        View Options

        Get Access

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media