article

Interconnect-Aware Coherence Protocols for Chip Multiprocessors

Authors:

Naveen Muralimanohar,

Karthik Ramani,

Rajeev Balasubramonian,

John B. CarterAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 34, Issue 2

Pages 339 - 351

https://doi.org/10.1145/1150019.1136515

Published: 01 May 2006 Publication History

Abstract

Improvements in semiconductor technology have made it possible to include multiple processor cores on a single die. Chip Multi-Processors (CMP) are an attractive choice for future billion transistor architectures due to their low design complexity, high clock frequency, and high throughput. In a typical CMP architecture, the L2 cache is shared by multiple cores and data coherence is maintained among private L1s. Coherence operations entail frequent communication over global on-chip wires. In future technologies, communication between different L1s will have a significant impact on overall processor performance and power consumption. On-chip wires can be designed to have different latency, bandwidth, and energy properties. Likewise, coherence protocol messages have different latency and bandwidth needs. We propose an interconnect composed of wires with varying latency, bandwidth, and energy characteristics, and advocate intelligently mapping coherence operations to the appropriate wires. In this paper, we present a comprehensive list of techniques that allow coherence protocols to exploit a heterogeneous interconnect and evaluate a subset of these techniques to show their performance and power-efficiency potential. Most of the proposed techniques can be implemented with a minimum complexity overhead.

References

[1]

{1} SGI Altix 3000 Configuration. "http://www.sgi.com/products/servers/altix/configs.html".

[2]

{2} M. E. Acacio, J. Gonzalez, J. M. Garcia, and J. Duato. The Use of Prediction for Accelerating Upgrade Misses in CCNUMA Multiprocessors. In Proceedings of PACT-11, 2002.

Digital Library

[3]

{3} V. Agarwal, M. Hrishikesh, S. Keckler, and D. Burger. Clock Rate versus IPC: The End of the Road for Conventional Microarchitectures. In Proceedings of ISCA-27, pages 248- 259, June 2000.

Digital Library

[4]

{4} H. Bakoglu. Circuits, Interconnections, and Packaging for VLSI. Addison-Wesley, 1990.

[5]

{5} R. Balasubramonian, N. Muralimanohar, K. Ramani, and V. Venkatachalapathy. Microarchitectural Wire Management for Performance and Power in Partitioned Architectures. In Proceedings of HPCA-11, February 2005.

Digital Library

[6]

{6} K. Banerjee and A. Mehrotra. A Power-optimal Repeater Insertion Methodology for Global Interconnects in Nanometer Designs. IEEE Transactions on Electron Devices, 49(11):2001-2007, November 2002.

[7]

{7} P. Bannon. Alpha 21364: A Scalable Single-Chip SMP. October 1998.

[8]

{8} B. Beckmann and D. Wood. TLC: Transmission Line Caches. In Proceedings of MICRO-36, December 2003.

Digital Library

[9]

{9} B. Beckmann and D. Wood. Managing Wire Delay in Large Chip-Multiprocessor Caches. In Proceedings of MICRO-37, December 2004.

Digital Library

[10]

{10} E. E. Bilir, R. M. Dickson, Y. Hu, M. Plakal, D. J. Sorin, M. D. Hill, and D. A. Wood. Multicast Snooping: A New Coherence Method using a Multicast Address Network. SIGARCH Comput. Archit. News, pages 294-304, 1999.

Digital Library

[11]

{11} F. A. Briggs, M. Cekleov, K. Creta, M. Khare, S. Kulick, A. Kumar, L. P. Looi, C. Natarajan, S. Radhakrishnan, and L. Rankin. Intel 870: A Building Block for Cost-Effective, Scalable Servers. IEEE Micro, 22(2):36-47, 2002.

Digital Library

[12]

{12} R. Chang, N. Talwalkar, C. Yue, and S. Wong. Near Speed-of-Light Signaling Over On-Chip Electrical Interconnects. IEEE Journal of Solid-State Circuits, 38(5):834-838, May 2003.

[13]

{13} Corporate Institute of Electrical and Electronics Engineers, Inc. Staff. IEEE Standard for Scalable Coherent Interface, Science: IEEE Std. 1596-1992. 1993.

Digital Library

[14]

{14} A. Cox and R. Fowler. Adaptive Cache Coherency for Detecting Migratory Shared Data. pages 98-108, May 1993.

[15]

{15} D. E. Culler and J. P. Singh. Parallel Computer Architecture: a Hardware/software Approach. Morgan Kaufmann Publishers, Inc, 1999.

Digital Library

[16]

{16} W. Dally and J. Poulton. Digital System Engineering. Cambridge University Press, Cambridge, UK, 1998.

Digital Library

[17]

{17} M. Galles and E. Williams. Performance Optimizations, Implementation, and Verification of the SGI Challenge Multiprocessor. In HICSS (1), pages 134-143, 1994.

[18]

{18} G. Gerosa and et al. A 2.2 W, 80 MHz Superscalar RISC Microprocessor. IEEE Journal of Solid-State Circuits, 29(12):1440-1454, December 1994.

[19]

{19} R. Ho, K. Mai, and M. Horowitz. The Future of Wires. Proceedings of the IEEE, Vol. 89, No. 4, April 2001.

[20]

{20} P. Hofstee. Power Efficient Processor Architecture and The Cell Processor. In Proceedings of HPCA-11 (Industrial Session) , February 2005.

Digital Library

[21]

{21} J. Huh, J. Chang, D. Burger, and G. S. Sohi. Coherence Decoupling: Making Use of Incoherence. In Proceedings of ASPLOS-XI, pages 97-106, 2004.

Digital Library

[22]

{22} J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. W. Keckler. A NUCA Substrate for Flexible CMP Cache Sharing. In ICS '05: Proceedings of the 19th annual international conference on Supercomputing, pages 31-40, New York, NY, USA, 2005. ACM Press.

Digital Library

[23]

{23} P. Kongetira. A 32-Way Multithreaded SPARC Processor. In Proceedings of Hot Chips 16, 2004. (http://www.hotchips.org/archives/).

[24]

{24} K. Krewell. UltraSPARC IV Mirrors Predecessor: Sun Builds Dualcore Chip in 130nm. Microprocessor Report, pages 1,5-6, Nov. 2003.

[25]

{25} R. Kumar, V. Zyuban, and D. Tullsen. Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads, and Scaling. In Proceedings of the 32nd ISCA, June 2005.

Digital Library

[26]

{26} A.-C. Lai and B. Falsafi. Memory Sharing Predictor: The Key to a Speculative Coherent DSM. In Proceedings of ISCA-26, 1999.

Digital Library

[27]

{27} A.-C. Lai and B. Falsafi. Selective, Accurate, and Timely Self-Invalidation Using Last-Touch Prediction. In Proceedings of ISCA-27, pages 139-148, 2000.

Digital Library

[28]

{28} J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In Proceedings of ISCA-24, pages 241-251, June 1997.

Digital Library

[29]

{29} A. R. Lebeck and D. A. Wood. Dynamic Self-Invalidation: Reducing Coherence Overhead in Shared-Memory Multiprocessors. In Proceedings of ISCA-22, pages 48-59, 1995.

Digital Library

[30]

{30} K. M. Lepak and M. H. Lipasti. Temporally Silent Stores. In Proceedings of ASPLOS-X, pages 30-41, 2002.

Digital Library

[31]

{31} J. Li, J. F. Martinez, and M. C. Huang. The Thrifty Barrier: Energy-Aware Synchronization in Shared-Memory Multiprocessors. In HPCA '04: Proceedings of the 10th International Symposium on High Performance Computer Architecture , page 14, Washington, DC, USA, 2004. IEEE Computer Society.

Digital Library

[32]

{32} N. Magen, A. Kolodny, U. Weiser, and N. Shamir. Interconnect Power Dissipation in a Microprocessor. In Proceedings of System Level Interconnect Prediction, February 2004.

Digital Library

[33]

{33} P. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A Full System Simulation Platform. IEEE Computer, 35(2):50-58, February 2002.

Digital Library

[34]

{34} M. Martin, D. Sorin, B. Beckmann, M. Marty, M. Xu, A. Alameldeen, K. Moore, M. Hill, and D. Wood. Multifacet's General Execution-Driven Multiprocessor Simulator (GEMS) Toolset. Computer Architecture News, 2005.

Digital Library

[35]

{35} M. M. K. Martin, M. D. Hill, and D. A. Wood. Token Coherence: Decoupling Performance and Correctness. In Proceedings of ISCA-30, 2003.

Digital Library

[36]

{36} M. R. Marty, J. D. Bingham, M. D. Hill, A. J. Hu, M. M. K. Martin, and D. A. Wood. Improving Multiple-CMP Systems Using Token Coherence. In HPCA, pages 328-339, 2005.

Digital Library

[37]

{37} M. L. Mui, K. Banerjee, and A. Mehrotra. A Global Interconnect Optimization Scheme for Nanometer Scale VLSI With Implications for Latency, Bandwidth, and Power Dissipation. IEEE Transactions on Electronic Devices, Vol. 51, No. 2, February 2004.

[38]

{38} S. Mukherjee, J. Emer, and S. Reinhardt. The Soft Error Problem: An Architectural Perspective. In Proceedings of HPCA-11 (Industrial Session), February 2005.

Digital Library

[39]

{39} N. Nelson, G. Briggs, M. Haurylau, G. Chen, H. Chen, D. Albonesi, E. Friedman, and P. Fauchet. Alleviating Thermal Constraints while Maintaining Performance Via Silicon-Based On-Chip Optical Interconnects. In Proceedings of Workshop on Unique Chips and Systems, March 2005.

[40]

{40} P. Stenström, M. Brorsson, and L. Sandberg. An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing. pages 109-118, May 1993.

[41]

{41} J. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. POWER4 System Microarchitecture. Technical report, IBM Server Group Whitepaper, October 2001.

[42]

{42} H. S. Wang, L. S. Peh, and S. Malik. A Power Model for Routers: Modeling Alpha 21364 and Infi niBand Routers. In IEEE Micro, Vol. 24, No. 1, January 2003.

Digital Library

[43]

{43} S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations. In Proceedings of ISCA-22, pages 24-36, June 1995.

Digital Library

Cited By

Lebedev DShkoldin VMozharov ALarin APermyakov DSamusev APetukhov AGolubok AArkhipov AMukhin I(2022)Nanoscale Electrically Driven Light Source Based on Hybrid Semiconductor/Metal NanoantennaThe Journal of Physical Chemistry Letters10.1021/acs.jpclett.2c0098613:20(4612-4620)Online publication date: 19-May-2022
https://doi.org/10.1021/acs.jpclett.2c00986
Di WHong-Liang L(2021)Microprocessor Architecture and Design in Post Exascale Computing Era2021 6th International Conference on Intelligent Computing and Signal Processing (ICSP)10.1109/ICSP51882.2021.9408861(20-32)Online publication date: 9-Apr-2021
https://doi.org/10.1109/ICSP51882.2021.9408861
Zou XXu SChen XYan LHan Y(2021)Breaking the von Neumann bottleneck: architecture-level processing-in-memory technologyScience China Information Sciences10.1007/s11432-020-3227-164:6Online publication date: 27-Apr-2021
https://doi.org/10.1007/s11432-020-3227-1
Show More Cited By

Index Terms

Interconnect-Aware Coherence Protocols for Chip Multiprocessors
1. Hardware

Recommendations

Interconnect-Aware Coherence Protocols for Chip Multiprocessors
ISCA '06: Proceedings of the 33rd annual international symposium on Computer Architecture

Improvements in semiconductor technology have made it possible to include multiple processor cores on a single die. Chip Multi-Processors (CMP) are an attractive choice for future billion transistor architectures due to their low design complexity, high ...
An efficient cache coherence mechanism for chip multiprocessors
Synthesis of predictable networks-on-chip-based interconnect architectures for chip multiprocessors

Today, chip multiprocessors (CMPs) that accommodate multiple processor cores on the same chip have become a reality. As the communication complexity of such multicore systems is rapidly increasing, designing an interconnect architecture with predictable ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 34, Issue 2

May 2006

383 pages

ISSN:0163-5964

DOI:10.1145/1150019

Issue’s Table of Contents

ISCA '06: Proceedings of the 33rd annual international symposium on Computer Architecture
June 2006
383 pages
ISBN:076952608X

Copyright © 2006 Authors.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 2006

Published in SIGARCH Volume 34, Issue 2

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

65
Total Citations
View Citations
1,128
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)1

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lebedev DShkoldin VMozharov ALarin APermyakov DSamusev APetukhov AGolubok AArkhipov AMukhin I(2022)Nanoscale Electrically Driven Light Source Based on Hybrid Semiconductor/Metal NanoantennaThe Journal of Physical Chemistry Letters10.1021/acs.jpclett.2c0098613:20(4612-4620)Online publication date: 19-May-2022
https://doi.org/10.1021/acs.jpclett.2c00986
Di WHong-Liang L(2021)Microprocessor Architecture and Design in Post Exascale Computing Era2021 6th International Conference on Intelligent Computing and Signal Processing (ICSP)10.1109/ICSP51882.2021.9408861(20-32)Online publication date: 9-Apr-2021
https://doi.org/10.1109/ICSP51882.2021.9408861
Zou XXu SChen XYan LHan Y(2021)Breaking the von Neumann bottleneck: architecture-level processing-in-memory technologyScience China Information Sciences10.1007/s11432-020-3227-164:6Online publication date: 27-Apr-2021
https://doi.org/10.1007/s11432-020-3227-1
Asaduzzaman AChidella KVardha D(2017)An Energy-Efficient Directory Based Multicore Architecture with Wireless Routers to Minimize the Communication LatencyIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2016.257128228:2(374-385)Online publication date: 1-Feb-2017
https://dl.acm.org/doi/10.1109/TPDS.2016.2571282
Abadal SMestres ANemirovsky MLee HGonzalez AAlarcon ECabellos-Aparicio A(2016)Scalability of Broadcast Performance in Wireless Network-on-ChipIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2016.253733227:12(3631-3645)Online publication date: 1-Dec-2016
https://dl.acm.org/doi/10.1109/TPDS.2016.2537332
Lee JKim HShin MKim JHuh J(2014)Mutually Aware Prefetcher and On-Chip Network Designs for Multi-CoresIEEE Transactions on Computers10.1109/TC.2013.9963:9(2316-2329)Online publication date: 1-Sep-2014
https://dl.acm.org/doi/10.1109/TC.2013.99
Peter EArora ABagaria ASarangi S(2014)Optical overlay NUCA: A high speed substrate for shared L2 caches2014 21st International Conference on High Performance Computing (HiPC)10.1109/HiPC.2014.7116711(1-10)Online publication date: Dec-2014
https://doi.org/10.1109/HiPC.2014.7116711
Jungju Oh Zajic APrvulovic M(2013)Automatic OpenCL work-group size selection for multicore CPUsProceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2013.6618827(387-398)Online publication date: Oct-2013
https://doi.org/10.1109/PACT.2013.6618827
Lodde MRoca TFlich J(2013)Built‐in fast gather control network for efficient support of coherence protocolsIET Computers & Digital Techniques10.1049/iet-cdt.2012.00567:2(69-80)Online publication date: Mar-2013
https://doi.org/10.1049/iet-cdt.2012.0056
Kapoor HKanakala PVerma MDas S(2013)Design and formal verification of a hierarchical cache coherence protocol for NoC based multiprocessorsThe Journal of Supercomputing10.1007/s11227-012-0865-865:2(771-796)Online publication date: 1-Aug-2013
https://dl.acm.org/doi/10.1007/s11227-012-0865-8
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents