article

Token coherence: decoupling performance and correctness

Authors:

Milo M. K. Martin,

David A. WoodAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 31, Issue 2

Pages 182 - 193

https://doi.org/10.1145/871656.859640

Published: 01 May 2003 Publication History

Abstract

Many future shared-memory multiprocessor servers will both target commercial workloads and use highly-integrated "glueless" designs. Implementing low-latency cache coherence in these systems is difficult, because traditional approaches either add indirection for common cache-to-cache misses (directory protocols) or require a totally-ordered interconnect (traditional snooping protocols). Unfortunately, totally-ordered interconnects are difficult to implement in glueless designs. An ideal coherence protocol would avoid indirections and interconnect ordering; however, such an approach introduces numerous protocol races that are difficult to resolve.We propose a new coherence framework to enable such protocols by separating performance from correctness. A performance protocol can optimize for the common case (i.e., absence of races) and rely on the underlying correctness substrate to resolve races, provide safety, and prevent starvation. We call the combination Token Coherence, since it explicitly exchanges and counts tokens to control coherence permissions.This paper develops TokenB, a specific Token Coherence performance protocol that allows a glueless multiprocessor to both exploit a low-latency unordered interconnect (like directory protocols) and avoid indirection (like snooping protocols). Simulations using commercial workloads show that our new protocol can significantly outperform traditional snooping and directory protocols.

References

[1]

H. Abdel-Shafi, J. Hall, S. V. Adve, and V. S. Adve. An Evaluation of Fine-Grain Producer-Initiated Communication in Cache-Coherent Multiprocessors. In Proceedings of the Third IEEE Symposium on High-Performance Computer Architecture, Feb. 1997.]]

Digital Library

[2]

M. E. Acacio, J. González, J. M. García, and J. Duato. Owner Prediction for Accelerating Cache-to-Cache Transfers in a cc-NUMA Architecture. In Proceedings of SC2002, Nov. 2002.]]

Digital Library

[3]

M. E. Acacio, J. González, J. M. García, and J. Duato. The Use of Prediction for Accelerating Upgrade Misses in cc-NUMA Multiprocessors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, pages 155--164, Sept. 2002.]]

Digital Library

[4]

S. V. Adve and K. Gharachorloo. Shared Memory Consistency Models: A Tutorial. IEEE Computer, 29(12):66--76, Dec. 1996.]]

Digital Library

[5]

A. Ahmed, P. Conway, B. Hughes, and F. Weber. AMD Opteron Shared Memory MP Systems. In Proceedings of the 14th HotChips Symposium, Aug. 2002. http://www.hotchips.org/archive/hc14/program/28_AMD_Hammer_MP_HC_v8.pdf.]]

[6]

A. R. Alameldeen, M. M. K. Martin, C. J. Mauer, K. E. Moore, M. Xu, D. J. Sorin, M. D. Hill, and D. A. Wood. Simulating a $2M Commercial Server on a $2K PC. IEEE Computer, 36(2):50--57, Feb. 2003.]]

Digital Library

[7]

M. Azimi, F. Briggs, M. Cekleov, M. Khare, A. Kumar, and L. P. Looi. Scalability Port: A Coherent Interface for Shared Memory Multiprocessors. In Proceedings of the 10th Hot Interconnects Symposium, pages 65--70, Aug. 2002.]]

Digital Library

[8]

L. A. Barroso, K. Gharachorloo, and E. Bugnion. Memory System Characterization of Commercial Workloads. In Proceedings of the 25th Annual International Symposium on Computer Architecture, pages 3--14, June 1998.]]

Digital Library

[9]

E. E. Bilir, R. M. Dickson, Y. Hu, M. Plakal, D. J. Sorin, M. D. Hill, and D. A. Wood. Multicast Snooping: A New Coherence Method Using a Multicast Address Network. In Proceedings of the 26th Annual International Symposium on Computer Architecture, pages 294--304, May 1999.]]

Digital Library

[10]

J. M. Borkenhagen, R. D. Hoover, and K. M. Valk. EXA Cache/Scalability Controllers. In IBM Enterprise X-Architecture Technology: Reaching the Summit, pages 37--50. International Business Machines, 2002.]]

[11]

A. Charlesworth. Starfire: Extending the SMP Envelope. IEEE Micro, 18(1):39--49, Jan/Feb 1998.]]

Digital Library

[12]

A. L. Cox and R. J. Fowler. Adaptive Cache Coherency for Detecting Migratory Shared Data. In Proceedings of the 20th Annual International Symposium on Computer Architecture, May 1993.]]

Digital Library

[13]

W. J. Dally and J. W. Poulton. Digital Systems Engineering. Cambridge University Press, 1998.]]

Digital Library

[14]

J. Duato, S. Yalamanchili, and L. Ni. Interconnection Networks: An Engineering Approach. Morgan Kaufmann, revised edition, 2003.]]

Digital Library

[15]

K. Farkas, Z. Vranesic, and M. Stumm. Scalable Cache Consistency for Hierarchically Structured Multiprocessors. The Journal of Supercomputing, 8(4), 1995.]]

Digital Library

[16]

S. J. Frank. Tightly Coupled Multiprocessor System Speeds Memory-access Times. Electronics, 57(1):164--169, Jan. 1984.]]

[17]

K. Gharachorloo, L. A. Barroso, and A. Nowatzyk. Efficient ECC-Based Directory Implementations for Scalable Multiprocessors. In Proceedings of the 12th Symposium on Computer Architecture and High-Performance Computing (SBAC-PAD 2000), Oct. 2000.]]

[18]

K. Gharachorloo, A. Gupta, and J. Hennessy. Two Techniques to Enhance the Performance of Memory Consistency Models. In Proceedings of the International Conference on Parallel Processing, volume I, pages 355--364, Aug. 1991.]]

[19]

M. D. Hill, J. R. Larus, S. K. Reinhardt, and D. A. Wood. Cooperative Shared Memory: Software and Hardware for Scalable Multiprocessor. ACM Transactions on Computer Systems, 11(4):300--318, Nov. 1993.]]

Digital Library

[20]

M. Horowitz, C.-K. K. Yang, and S. Sidiropoulos. High-Speed Electrical Signaling: Overview and Limitations. IEEE Micro, 18(1), January/February 1998.]]

Digital Library

[21]

D. A. Koufaty, X. Chen, D. K. Poulsen, and J. Torrellas. Data Forwarding in Scalable Shared-Memory Multiprocessors. In Proceedings of the 1995 International Conference on Supercomputing, July 1995.]]

Digital Library

[22]

A. Landin, E. Hagersten, and S. Haridi. Race-Free Interconnection Networks and Multiprocessor Consistency. In Proceedings of the 18th Annual International Symposium on Computer Architecture, May 1991.]]

Digital Library

[23]

J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 241--251, June 1997.]]

Digital Library

[24]

D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. Lam. The Stanford DASH Multiprocessor. IEEE Computer, 25(3):63--79, Mar. 1992.]]

Digital Library

[25]

K. Li and P. Hudak. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems, 7(4):321--359, 1989.]]

Digital Library

[26]

P. S. Magnusson et al. Simics: A Full System Simulation Platform. IEEE Computer, 35(2):50--58, Feb. 2002.]]

Digital Library

[27]

M. M. K. Martin, P. J. Harper, D. J. Sorin, M. D. Hill, and D. A. Wood. Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared Memory Multiprocessors. In Proceedings of the 30th Annual International Symposium on Computer Architecture, June 2003.]]

Digital Library

[28]

M. M. K. Martin, D. J. Sorin, A. Ailamaki, A. R. Alameldeen, R. M. Dickson, C. J. Mauer, K. E. Moore, M. Plakal, M. D. Hill, and D. A. Wood. Timestamp Snooping: An Approach for Extending SMPs. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 25--36, Nov. 2000.]]

Digital Library

[29]

M. M. K. Martin, D. J. Sorin, M. D. Hill, and D. A. Wood. Bandwidth Adaptive Snooping. In Proceedings of the Eighth IEEE Symposium on High-Performance Computer Architecture, Feb. 2002.]]

Digital Library

[30]

C. J. Mauer, M. D. Hill, and D. A. Wood. Full System Timing-First Simulation. In Proceedings of the 2002 ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, pages 108--116, June 2002.]]

Digital Library

[31]

A. Moshovos, G. Memik, B. Falsafi, and A. Choudhary. JETTY: Filtering Snoops for Reduced Power Consumption in SMP Servers. In Proceedings of the Seventh IEEE Symposium on High-Performance Computer Architecture, Jan. 2001.]]

Digital Library

[32]

S. S. Mukherjee, P. Bannon, S. Lang, A. Spink, and D. Webb. The Alpha 21364 Network Architecture. In Proceedings of the 9th Hot Interconnects Symposium, Aug. 2001.]]

Digital Library

[33]

A. K. Nanda, A.-T. Nguyen, M. M. Michael, and D. J. Joseph. High-Throughput Coherence Controllers. In Proceedings of the Sixth IEEE Symposium on High-Performance Computer Architecture, Jan. 2000.]]

[34]

A. Nowatzyk, G. Aybay, M. Browne, E. Kelly, and M. Parkin. The S3.mp Scalable Shared Memory Multiprocessor. In Proceedings of the International Conference on Parallel Processing, volume I, pages 1--10, Aug. 1995.]]

[35]

D. Poulsen and P.-C. Yew. Data Prefetching and Data Forwarding in Shared-Memory Multiprocessors. In Proceedings of the International Conference on Parallel Processing, volume II, pages 296--280, Aug. 1994.]]

Digital Library

[36]

I. Pragaspathy and B. Falsafi. Address Partitioning in DSM Clusters with Parallel Coherence Controllers. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, Oct. 2000.]]

Digital Library

[37]

E. Rosti, E. Smirni, T. Wagner, A. Apon, and L. Dowdy. The KSR1: Experimentation and Modeling of Poststore. In Proceedings of the 1993 ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, pages 74--85, May 1993.]]

Digital Library

[38]

X. Shen, Arvind, and L. Rudolph. CACHET: An Adaptive Cache Coherence Protocol for Distributed Shared-Memory Systems. In Proceedings of the 1999 International Conference on Supercomputing, pages 135--144, June 1998.]]

Digital Library

[39]

P. Stenström. A Cache Consistency Protocol for Multiprocessors with Multistage Networks. In Proceedings of the 16th Annual International Symposium on Computer Architecture, May 1989.]]

Digital Library

[40]

P. Stenström, M. Brorsson, and L. Sandberg. Adaptive Cache Coherence Protocol Optimized for Migratory Sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 109--118, May 1993.]]

Digital Library

[41]

P. Sweazey and A. J. Smith. A Class of Compatible Cache Consistency Protocols and their Support by the IEEE Futurebus. In Proceedings of the 13th Annual International Symposium on Computer Architecture, pages 414--423, June 1986.]]

Digital Library

[42]

J. M. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. POWER4 System Microarchitecture. IBM Server Group Whitepaper, Oct. 2001.]]

[43]

K. C. Yeager. The MIPS R10000 Superscalar Microprocessor. IEEE Micro, 16(2):28--40, Apr. 1996.]]

Digital Library

Cited By

Fiolhais LSousa L(2023)Transient-Execution Attacks: A Computer Architect PerspectiveACM Computing Surveys10.1145/360361956:3(1-38)Online publication date: 6-Oct-2023
https://dl.acm.org/doi/10.1145/3603619
Upadhyay BRos AM. S(2023)Fine-grain data classification to filter token coherence trafficJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.09.004171(40-53)Online publication date: Jan-2023
https://doi.org/10.1016/j.jpdc.2022.09.004
Thillai Rani MRajkumar RSai Pradeep KJaishree MTamilSelvan S(2022)Cache Coherence for Embedded Multi-core System Architectures: A Survey and ChallengesIoT Based Control Networks and Intelligent Systems10.1007/978-981-19-5845-8_49(689-702)Online publication date: 12-Oct-2022
https://doi.org/10.1007/978-981-19-5845-8_49
Show More Cited By

Index Terms

Token coherence: decoupling performance and correctness

Index terms have been assigned to the content through auto-classification.

Recommendations

Token coherence: decoupling performance and correctness
ISCA '03: Proceedings of the 30th annual international symposium on Computer architecture

Many future shared-memory multiprocessor servers will both target commercial workloads and use highly-integrated "glueless" designs. Implementing low-latency cache coherence in these systems is difficult, because traditional approaches either add ...
Improving Token Coherence by Multicast Coherence Messages
PDP '08: Proceedings of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008)

Token Coherence is a cache coherence protocol that joins the main advantages of traditional protocols. However, unlike them, Token Coherence does not handle messages in order, which may lead to races, causing some cache misses not to be solved. To ...
Fine-grain data classification to filter token coherence traffic
Abstract
Snoop-based cache coherence protocols perform well in small-scale systems by enabling low latency cache-to-cache data transfers in just two-hop coherence transactions. However, they are not a scalable alternative as they require ...
Highlights
- Evaluation of TLB-based private/shared classification with varying granularities.

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 31, Issue 2

ISCA 2003

May 2003

422 pages

ISSN:0163-5964

DOI:10.1145/871656

Issue’s Table of Contents

ISCA '03: Proceedings of the 30th annual international symposium on Computer architecture
June 2003
432 pages
ISBN:0769519458
DOI:10.1145/859618
Conference Chair:
Allan Gottlieb
New York University & NEC Laboratories America
,
Program Chair:
Kai Li
Princeton University

Copyright © 2003 Authors.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 2003

Published in SIGARCH Volume 31, Issue 2

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

222
Total Citations
View Citations
1,310
Total Downloads

Downloads (Last 12 months)44
Downloads (Last 6 weeks)5

Reflects downloads up to 13 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Fiolhais LSousa L(2023)Transient-Execution Attacks: A Computer Architect PerspectiveACM Computing Surveys10.1145/360361956:3(1-38)Online publication date: 6-Oct-2023
https://dl.acm.org/doi/10.1145/3603619
Upadhyay BRos AM. S(2023)Fine-grain data classification to filter token coherence trafficJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.09.004171(40-53)Online publication date: Jan-2023
https://doi.org/10.1016/j.jpdc.2022.09.004
Thillai Rani MRajkumar RSai Pradeep KJaishree MTamilSelvan S(2022)Cache Coherence for Embedded Multi-core System Architectures: A Survey and ChallengesIoT Based Control Networks and Intelligent Systems10.1007/978-981-19-5845-8_49(689-702)Online publication date: 12-Oct-2022
https://doi.org/10.1007/978-981-19-5845-8_49
Gade SDeb S(2021)A Novel Hybrid Cache Coherence with Global Snooping for Many-core ArchitecturesACM Transactions on Design Automation of Electronic Systems10.1145/346277527:1(1-31)Online publication date: 13-Sep-2021
https://dl.acm.org/doi/10.1145/3462775
Krishna TBharadwaj S(2021)Interconnect Modeling for Homogeneous and Heterogeneous MultiprocessorsNetwork-on-Chip Security and Privacy10.1007/978-3-030-69131-8_2(31-54)Online publication date: 22-Jan-2021
https://doi.org/10.1007/978-3-030-69131-8_2
Menezo LPuente VGregorio J(2017)An adaptive cache coherence protocolJournal of Parallel and Distributed Computing10.1016/j.jpdc.2016.12.020102:C(163-174)Online publication date: 1-Apr-2017
https://dl.acm.org/doi/10.1016/j.jpdc.2016.12.020
Al-Manasia MChaczko Z(2015)Evaluation of Cache Coherence Mechanisms for Multicore ProcessorsComputational Intelligence and Efficiency in Engineering Systems10.1007/978-3-319-15720-7_22(307-320)Online publication date: 11-Mar-2015
https://doi.org/10.1007/978-3-319-15720-7_22
Ebrahimi MDaneshtalab MLiljeberg PPlosila JFlich JTenhunen H(2014)Path-Based Partitioning Methods for 3D Networks-on-Chip with Minimal Adaptive RoutingIEEE Transactions on Computers10.1109/TC.2012.25563:3(718-733)Online publication date: 1-Mar-2014
https://dl.acm.org/doi/10.1109/TC.2012.255
Sem-Jacobsen FRodrigo SStrano ASkeie TBertozzi DGilabert F(2013)Enabling power efficiency through dynamic rerouting on-chipACM Transactions on Embedded Computing Systems10.1145/2485984.248599912:4(1-23)Online publication date: 3-Jul-2013
https://dl.acm.org/doi/10.1145/2485984.2485999
Sem-Jacobsen FRodrigo SSkeie TStrano ABertozzi D(2013)An efficient, low-cost routing framework for convex mesh partitions to support virtualizationACM Transactions on Embedded Computing Systems10.1145/2485984.248599512:4(1-24)Online publication date: 3-Jul-2013
https://dl.acm.org/doi/10.1145/2485984.2485995
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents