research-article

Integrated Coherence Prediction: Towards Efficient Cache Coherence on NoC-Based Multicore Architectures

Authors:

Yongwen Wang, and

Qiang DouAuthors Info & Claims

ACM Transactions on Design Automation of Electronic Systems (TODAES), Volume 19, Issue 3

Article No.: 24, Pages 1 - 22

https://doi.org/10.1145/2611756

Published: 23 June 2014 Publication History

Abstract

Multicore architectures with Network-on-Chips (NoCs) have been widely recognized as the de facto design for the efficient utilization of the continuously increasing density of transistors on a chip. A key challenge in designing such an NoC-based multicore processor is maintaining cache coherence in an efficient manner. Directory-based protocols avoid the bandwidth overhead of snoop-based protocols, therefore scaling to a large number of cores. However, conventional directory structures add significant indirection delay to cache-to-cache accesses in larger multicore processor.

In this article we propose a novel hardware coherence technique, called integrated coherence prediction (ICP). This approach adopts a prediction technique for managing shared data to reduce or eliminate the cache-to-cache delay in coherence accesses. ICP has two unique features that differ from previous coherence prediction techniques. First, ICP introduces a new integrated prediction scheme that combines two kinds of predictors: owner predictor, which predicts the data writers and avoids the indirection through directory, and data predictor, which predicts the access address and prefetches data from remote nodes directly. Second, ICP uses a request replication method to reduce the negative effect of wrong owner prediction operations, thus facilitating overall performance improvement. We present the design and implementation details of the ICP approach. Using detailed full-system simulations, we conclude that the ICP provides a cost-effective solution for designing high-performance multicore processors.

References

[1]

Hazim Abdel-Shafi, Jonathan Hall, Sarita V. Adve, and Vikram S. Adve. 1997. An evaluation of fine-grain producer-initiated communication in cache-coherent multiprocessors. In Proceedings of the 3^rd International Symposium on High-Performance Computer Architecture (HPCA'97). 204--215.

Digital Library

[2]

Manuel E. Acacio, Jose Gonzalez, Jose M. Garcia, and Jose Duato. 2002a. Owner prediction for accelerating cache-to-cache transfer misses in a CC-NUMA architecture. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC'02). IEEE Computer Society Press, 1--12.

Digital Library

[3]

Manuel E. Acacio, Jose Gonzalez, Jose M. Garcia, and Jose Duato. 2002b. The use of prediction for accelerating upgrade misses in CC-NUMA multiprocessors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT'02). IEEE Computer Society, 155--164.

Digital Library

[4]

Ehsan Atoofian and Amirali Baniasadi. 2007. A power-aware prediction-based cache coherence protocol for chip multiprocessors. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS'07). 1--8.

[5]

Nathan L. Binkert, Ronald G. Dreslinski, Lisa R. Hsu, Kevin T. Lim, Ali G. Saidi, and Steven K. Reinhardt. 2006. The m5 simulator: Modeling networked systems. IEEE Micro 26, 4, 52--60.

Digital Library

[6]

CACTI. 2013. An integrated cache and memory access time, cycle time, area, leakage, and dynamic power model. http://www.hpl.hp.com/research/cacti/.

[7]

Liqun Cheng, John B. Carter, and Donglai Dai. 2007. An adaptive cache coherence protocol optimized for producer-consumer sharing. In Proceedings of the IEEE 13^th International Symposium on High Performance Computer Architecture (HPCA'07). IEEE Computer Society, 328--339.

Digital Library

[8]

Liqun Cheng, Naveen Muralimanohar, Karthik Ramani, Rajeev Balasubramonian, and John B. Carter. 2006. Interconnect-aware coherence protocols for chip multiprocessors. In Proceedings of the 33^rd Annual International Symposium on Computer Architecture (ISCA'06). 339--351.

Digital Library

[9]

Socrates Demetriades and Sangyeun Cho. 2012. Predicting coherence communication by tracking synchronization points at run time. In Proceedings of the 45^th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'12). IEEE Computer Society, 351--362.

Digital Library

[10]

David L. Dill. 1996. The Mur φ verification system. In Proceedings of the 8^th International Conference on Computer Aided Verification (CAV'96). Springer, 390--393.

Digital Library

[11]

Noel Eisley, Li-Shiuan Peh, and Li Shang. 2006. In-network cache coherence. In Proceedings of the 39^th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06). 321--332.

Digital Library

[12]

Natalie D. Enright Jerger, Li-Shiuan Peh, and Mikko H. Lipasti. 2008. Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence. In Proceedings of the 41^st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'08). IEEE Computer Society, 35--46.

Digital Library

[13]

Hemayet Hossain, Sandhya Dwarkadas, and Michael C. Huang. 2008. Improving support for locality and fine-grain sharing in chip multiprocessors. In Proceedings of the 17^th International Conference on Parallel Architectures and Compilation Techniques (PACT'08). ACM Press, New York, 155--165.

Digital Library

[14]

Libo Huang, Zhiying Wang, and Nong Xiao. 2012. An optimized multicore cache coherence design for exploiting communication locality. In Proceedings of the Great Lakes Symposium on VLSI (GLSVLSI'12). ACM Press, New York, 59--62.

Digital Library

[15]

Libo Huang, Zhiying Wang, and Nong Xiao. 2013. VBON: Toward efficient on-chip networks via hierarchical virtual bus. Micro. Microsyst.- Embed. Hardware Des. 37, 8-B, 915--928.

Digital Library

[16]

Jaehyuk Huh, Changkyu Kim, Hazim Shafi, Lixin Zhang, Doug Burger, and Stephen W. Keckler. 2005. ANUCA substrate for flexible CMP cache sharing. In Proceedings of the 19^th International Conference on Supercomputing (ICS'05). ACM Press, New York, 31--40.

Digital Library

[17]

Ravi Iyer and Laxmi Narayan Bhuyan. 1999. Switch cache: A framework for improving the remote memory access latency of CC-NUMA multiprocessors. In Proceedings of the 5^th International Symposium on High Performance Computer Architecture (HPCA'99). 152--160.

Digital Library

[18]

Norman P. Jouppi. 1990. Improving direct-mapped cache performance by the addition of a small fully- associative cache and prefetch buffers. In Proceedings of the 17^th Annual International Symposium on Computer Architecture (ISCA'90). ACM Press, New York, 364--373.

Digital Library

[19]

Andrew B. Kahng, Bin Li, Li-Shiuan Peh, and Kambiz Samadi. 2009. ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration. In Proceedings of the Design, Automation and Test in Europe Conference (DATE'09). 423--428.

Digital Library

[20]

Stefanos Kaxiras and James R. Goodman. 1999. Improving CC-NUMA performance using instruction-based prediction. In Proceedings of the 5^th International Symposium on High Performance Computer Architecture (HPCA'99). IEEE Computer Society, 161.

Digital Library

[21]

Stefanos Kaxiras and Georgios Keramidas. 2010. SARC coherence: Scaling directory cache coherence in performance and power. IEEE Micro 30, 5, 54--65.

Digital Library

[22]

Stefanos Kaxiras and Cliff Young. 2000. Coherence communication prediction in shared-memory multiprocessors. In Proceedings of the 6^th International Symposium on High Performance Computer Architecture (HPCA'00). 156.

[23]

Abdullah Kayi and Tarek El-Ghazawi. 2010. An adaptive cache coherence protocol for chip multiprocessors. In Proceedings of the 2^nd International Forum on Next-Generation Multicore/Manycore Technologies (IFMT'10). ACM Press, New York.

Digital Library

[24]

David A. Koufaty, Xiangfeng Chen, David K. Poulsen, and Josep Torrellas. 1996. Data forwarding in scalable shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst. 7, 12, 1250--1264.

Digital Library

[25]

George Kurian, Omer Khan, and Srinivas Devadas. 2013. The locality-aware adaptive cache coherence protocol. In Proceedings of the 40^th Annual International Symposium on Computer Architecture (ISCA'13). ACM Press, New York, 523--534.

Digital Library

[26]

Jeffery Kuskin, David Ofelt, Mark Heinrich, John Heinlein, Richard Simoni, Kourosh Gharachorloo, John Chapin, David Nakahira, Joel Baxter, Mark Horowitz, Anoop Gupta, Mendel Rosenblum, and John Hennessy. 1994. The Stanford flash multiprocessor. In Proceedings of the 21^st Annual International Symposium on Computer Architecture (ISCA'94). IEEE Computer Society Press, 302--313.

Digital Library

[27]

An-Chow Lai and Babak Falsafi. 1999. Memory sharing predictor: The key to a speculative coherent DSM. In Proceedings of the 26^th Annual International Symposium on Computer Architecture (ISCA'99). 172--183.

Digital Library

[28]

James Laudon and Daniel Lenoski. 1997. The SGI origin: A CC-NUMA highly scalable server. In Proceedings of the 24^th Annual International Symposium on Computer Architecture (ISCA'97). 241--251.

Digital Library

[29]

Daniel Lenoski, James Laudon, Kourosh Gharachorloo, Wolf-Dietrich Weber, Anoop Gupta, John Hennessy, Mark Horowitz, and Monica S. Lam. 1992. The Stanford dash multiprocessor. Comput. 25, 3, 63--79.

Digital Library

[30]

Sean Leventhal and Manoj Franklin. 2006. Perceptron based consumer prediction in shared-memory multiprocessors. In Proceedings of the International Conference on Computer Design (ICCD'06). 148--154.

[31]

Mario Lodde, Antoni Roca, and Jose Flich. 2013. Built-in fast gather control network for efficient support of coherence protocols. IET Comput. Digit. Techniques 7, 2.

[32]

Tom Lovett and Russell Clapp. 1996. STiNG: A CC-NUMA computer system for the commercial marketplace. In Proceedings of the 23^rd Annual International Symposium on Computer Architecture (ISCA'96). ACM Press, New York, 308--317.

Digital Library

[33]

Milo M. K. Martin, Pacia J. Harper, Daniel J. Sorin, Mark D. Hill, and David A. Wood. 2003. Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors. In Proceedings of the 30^th Annual International Symposium on Computer Architecture (ISCA'03). ACM Press, New York, 206--217.

Digital Library

[34]

Maged M. Michael and Ashwini K. Nanda. 1999. Design and performance of directory caches for scalable shared memory multiprocessors. In Proceedings of the 5^th International Symposium on High Performance Computer Architecture (HPCA'99). IEEE Computer Society, 142.

Digital Library

[35]

Shubhendu S. Mukherjee and Mark D. Hill. 1998. Using prediction to accelerate coherence protocols. In Proceedings of the 25^th Annual International Symposium on Computer Architecture (ISCA'98). 179--190.

Digital Library

[36]

Subbarao Palacharla and Richard E. Kessler. 1994. Evaluating stream buffers as a secondary cache replacement. In Proceedings of the 21^st Annual International Symposium on Computer Architecture (ISCA'94). 24--33.

Digital Library

[37]

Alberto Ros, Manuel E. Acacio, and Jose M. Garcia. 2008. DiCo-CMP: Efficient cache coherency in tiled CMP architectures. In Proceedings of the IEEE Symposium on Parallel and Distributed Processing (IPDPS'08). 1--11.

[38]

Robert Stets, Sandhya Dwarkadas, Leonidas I. Kontothanassis, Umit Rencuzogullari, and Michael L. Scott. 2000. The effect of network total order, broadcast, and remote-write capability on network-based shared memory computing. In Proceedings of the 6^th International Symposium on High Performance Computer Architecture (HPCA'00). IEEE Computer Society, 265--276.

[39]

Thomas F. Wenisch, Stephen Somogyi, Nikolaos Hardavellas, Jangwoo Kim, Chris Gniady, Anastassia Ailamaki, and Babak Falsafi. 2005. Store-ordered streaming of shared memory. In Proceedings of the 14^th International Conference on Parallel Architectures and Compilation Techniques (PACT'05). 75--86.

Digital Library

[40]

David Wentzlaff, Patrick Griffin, Henry Hoffmann, Liewei Bao, Bruce Edwards, Carl Ramey, Matthew Mattina, Chyi-Chang Miao, John F. Brown Iii, and Anant Agarwal. 2007. On-chip interconnection architecture of the tile processor. IEEE Micro 27, 5, 15--31.

Digital Library

[41]

Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22^nd Annual International Symposium on Computer Architecture (ISCA'95). ACM Press, New York, 24--36.

Digital Library

[42]

Tse-Yu Yeh and Yale N. Patt. 1992. Alternative implementations of two-level adaptive branch prediction. In Proceedings of the 19^th Annual International Symposium on Computer Architecture (ISCA'92). ACM Press, New York, 124--134.

Digital Library

Cited By

Zhou HDeng RFeng QNi XDou Q(2018)Research of Configurable Hybrid Memory Architecture for Big Data ProcessingComputer Engineering and Technology10.1007/978-981-10-7844-6_12(116-132)Online publication date: 3-Jan-2018
https://doi.org/10.1007/978-981-10-7844-6_12
Chen RWang YHu JLiu DShao ZGuan Y(2017)vFlash: Virtualized Flash for Optimizing the I/O Performance in Mobile DevicesIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2016.261888136:7(1203-1214)Online publication date: 16-Jun-2017
https://dl.acm.org/doi/10.1109/TCAD.2016.2618881

Index Terms

Integrated Coherence Prediction: Towards Efficient Cache Coherence on NoC-Based Multicore Architectures
1. Hardware
  1. Communication hardware, interfaces and storage
2. Networks

Recommendations

The locality-aware adaptive cache coherence protocol
ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture

Next generation multicore applications will process massive amounts of data with significant sharing. Data movement and management impacts memory access latency and consumes power. Therefore, harnessing data locality is of fundamental importance in ...
Read More
The locality-aware adaptive cache coherence protocol
ICSA '13

Next generation multicore applications will process massive amounts of data with significant sharing. Data movement and management impacts memory access latency and consumes power. Therefore, harnessing data locality is of fundamental importance in ...
Read More
Balanced Prefetching Aggressiveness Controller for NoC-based Multiprocessor
SBCCI '14: Proceedings of the 27th Symposium on Integrated Circuits and Systems Design

The performance gap between memory hierarchy and processor is a well-known issue and the prefetching approach is often used to minimize this problem. This technique performs a data prefetch in memory and makes it available in the private cache before ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems

ACM Transactions on Design Automation of Electronic Systems Volume 19, Issue 3

June 2014

257 pages

ISSN:1084-4309

EISSN:1557-7309

DOI:10.1145/2634048

Issue’s Table of Contents

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 23 June 2014

Accepted: 01 February 2014

Revised: 01 December 2013

Received: 01 March 2013

Published in TODAES Volume 19, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
314
Total Downloads

Downloads (Last 12 months)29
Downloads (Last 6 weeks)1

Other Metrics

View Author Metrics

Citations

Cited By

Zhou HDeng RFeng QNi XDou Q(2018)Research of Configurable Hybrid Memory Architecture for Big Data ProcessingComputer Engineering and Technology10.1007/978-981-10-7844-6_12(116-132)Online publication date: 3-Jan-2018
https://doi.org/10.1007/978-981-10-7844-6_12
Chen RWang YHu JLiu DShao ZGuan Y(2017)vFlash: Virtualized Flash for Optimizing the I/O Performance in Mobile DevicesIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2016.261888136:7(1203-1214)Online publication date: 16-Jun-2017
https://dl.acm.org/doi/10.1109/TCAD.2016.2618881

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents