Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Token coherence: decoupling performance and correctness

Published: 01 May 2003 Publication History

Abstract

Many future shared-memory multiprocessor servers will both target commercial workloads and use highly-integrated "glueless" designs. Implementing low-latency cache coherence in these systems is difficult, because traditional approaches either add indirection for common cache-to-cache misses (directory protocols) or require a totally-ordered interconnect (traditional snooping protocols). Unfortunately, totally-ordered interconnects are difficult to implement in glueless designs. An ideal coherence protocol would avoid indirections and interconnect ordering; however, such an approach introduces numerous protocol races that are difficult to resolve.We propose a new coherence framework to enable such protocols by separating performance from correctness. A performance protocol can optimize for the common case (i.e., absence of races) and rely on the underlying correctness substrate to resolve races, provide safety, and prevent starvation. We call the combination Token Coherence, since it explicitly exchanges and counts tokens to control coherence permissions.This paper develops TokenB, a specific Token Coherence performance protocol that allows a glueless multiprocessor to both exploit a low-latency unordered interconnect (like directory protocols) and avoid indirection (like snooping protocols). Simulations using commercial workloads show that our new protocol can significantly outperform traditional snooping and directory protocols.

References

[1]
H. Abdel-Shafi, J. Hall, S. V. Adve, and V. S. Adve. An Evaluation of Fine-Grain Producer-Initiated Communication in Cache-Coherent Multiprocessors. In Proceedings of the Third IEEE Symposium on High-Performance Computer Architecture, Feb. 1997.]]
[2]
M. E. Acacio, J. González, J. M. García, and J. Duato. Owner Prediction for Accelerating Cache-to-Cache Transfers in a cc-NUMA Architecture. In Proceedings of SC2002, Nov. 2002.]]
[3]
M. E. Acacio, J. González, J. M. García, and J. Duato. The Use of Prediction for Accelerating Upgrade Misses in cc-NUMA Multiprocessors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, pages 155--164, Sept. 2002.]]
[4]
S. V. Adve and K. Gharachorloo. Shared Memory Consistency Models: A Tutorial. IEEE Computer, 29(12):66--76, Dec. 1996.]]
[5]
A. Ahmed, P. Conway, B. Hughes, and F. Weber. AMD Opteron Shared Memory MP Systems. In Proceedings of the 14th HotChips Symposium, Aug. 2002. http://www.hotchips.org/archive/hc14/program/28_AMD_Hammer_MP_HC_v8.pdf.]]
[6]
A. R. Alameldeen, M. M. K. Martin, C. J. Mauer, K. E. Moore, M. Xu, D. J. Sorin, M. D. Hill, and D. A. Wood. Simulating a $2M Commercial Server on a $2K PC. IEEE Computer, 36(2):50--57, Feb. 2003.]]
[7]
M. Azimi, F. Briggs, M. Cekleov, M. Khare, A. Kumar, and L. P. Looi. Scalability Port: A Coherent Interface for Shared Memory Multiprocessors. In Proceedings of the 10th Hot Interconnects Symposium, pages 65--70, Aug. 2002.]]
[8]
L. A. Barroso, K. Gharachorloo, and E. Bugnion. Memory System Characterization of Commercial Workloads. In Proceedings of the 25th Annual International Symposium on Computer Architecture, pages 3--14, June 1998.]]
[9]
E. E. Bilir, R. M. Dickson, Y. Hu, M. Plakal, D. J. Sorin, M. D. Hill, and D. A. Wood. Multicast Snooping: A New Coherence Method Using a Multicast Address Network. In Proceedings of the 26th Annual International Symposium on Computer Architecture, pages 294--304, May 1999.]]
[10]
J. M. Borkenhagen, R. D. Hoover, and K. M. Valk. EXA Cache/Scalability Controllers. In IBM Enterprise X-Architecture Technology: Reaching the Summit, pages 37--50. International Business Machines, 2002.]]
[11]
A. Charlesworth. Starfire: Extending the SMP Envelope. IEEE Micro, 18(1):39--49, Jan/Feb 1998.]]
[12]
A. L. Cox and R. J. Fowler. Adaptive Cache Coherency for Detecting Migratory Shared Data. In Proceedings of the 20th Annual International Symposium on Computer Architecture, May 1993.]]
[13]
W. J. Dally and J. W. Poulton. Digital Systems Engineering. Cambridge University Press, 1998.]]
[14]
J. Duato, S. Yalamanchili, and L. Ni. Interconnection Networks: An Engineering Approach. Morgan Kaufmann, revised edition, 2003.]]
[15]
K. Farkas, Z. Vranesic, and M. Stumm. Scalable Cache Consistency for Hierarchically Structured Multiprocessors. The Journal of Supercomputing, 8(4), 1995.]]
[16]
S. J. Frank. Tightly Coupled Multiprocessor System Speeds Memory-access Times. Electronics, 57(1):164--169, Jan. 1984.]]
[17]
K. Gharachorloo, L. A. Barroso, and A. Nowatzyk. Efficient ECC-Based Directory Implementations for Scalable Multiprocessors. In Proceedings of the 12th Symposium on Computer Architecture and High-Performance Computing (SBAC-PAD 2000), Oct. 2000.]]
[18]
K. Gharachorloo, A. Gupta, and J. Hennessy. Two Techniques to Enhance the Performance of Memory Consistency Models. In Proceedings of the International Conference on Parallel Processing, volume I, pages 355--364, Aug. 1991.]]
[19]
M. D. Hill, J. R. Larus, S. K. Reinhardt, and D. A. Wood. Cooperative Shared Memory: Software and Hardware for Scalable Multiprocessor. ACM Transactions on Computer Systems, 11(4):300--318, Nov. 1993.]]
[20]
M. Horowitz, C.-K. K. Yang, and S. Sidiropoulos. High-Speed Electrical Signaling: Overview and Limitations. IEEE Micro, 18(1), January/February 1998.]]
[21]
D. A. Koufaty, X. Chen, D. K. Poulsen, and J. Torrellas. Data Forwarding in Scalable Shared-Memory Multiprocessors. In Proceedings of the 1995 International Conference on Supercomputing, July 1995.]]
[22]
A. Landin, E. Hagersten, and S. Haridi. Race-Free Interconnection Networks and Multiprocessor Consistency. In Proceedings of the 18th Annual International Symposium on Computer Architecture, May 1991.]]
[23]
J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 241--251, June 1997.]]
[24]
D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. Lam. The Stanford DASH Multiprocessor. IEEE Computer, 25(3):63--79, Mar. 1992.]]
[25]
K. Li and P. Hudak. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems, 7(4):321--359, 1989.]]
[26]
P. S. Magnusson et al. Simics: A Full System Simulation Platform. IEEE Computer, 35(2):50--58, Feb. 2002.]]
[27]
M. M. K. Martin, P. J. Harper, D. J. Sorin, M. D. Hill, and D. A. Wood. Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared Memory Multiprocessors. In Proceedings of the 30th Annual International Symposium on Computer Architecture, June 2003.]]
[28]
M. M. K. Martin, D. J. Sorin, A. Ailamaki, A. R. Alameldeen, R. M. Dickson, C. J. Mauer, K. E. Moore, M. Plakal, M. D. Hill, and D. A. Wood. Timestamp Snooping: An Approach for Extending SMPs. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 25--36, Nov. 2000.]]
[29]
M. M. K. Martin, D. J. Sorin, M. D. Hill, and D. A. Wood. Bandwidth Adaptive Snooping. In Proceedings of the Eighth IEEE Symposium on High-Performance Computer Architecture, Feb. 2002.]]
[30]
C. J. Mauer, M. D. Hill, and D. A. Wood. Full System Timing-First Simulation. In Proceedings of the 2002 ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, pages 108--116, June 2002.]]
[31]
A. Moshovos, G. Memik, B. Falsafi, and A. Choudhary. JETTY: Filtering Snoops for Reduced Power Consumption in SMP Servers. In Proceedings of the Seventh IEEE Symposium on High-Performance Computer Architecture, Jan. 2001.]]
[32]
S. S. Mukherjee, P. Bannon, S. Lang, A. Spink, and D. Webb. The Alpha 21364 Network Architecture. In Proceedings of the 9th Hot Interconnects Symposium, Aug. 2001.]]
[33]
A. K. Nanda, A.-T. Nguyen, M. M. Michael, and D. J. Joseph. High-Throughput Coherence Controllers. In Proceedings of the Sixth IEEE Symposium on High-Performance Computer Architecture, Jan. 2000.]]
[34]
A. Nowatzyk, G. Aybay, M. Browne, E. Kelly, and M. Parkin. The S3.mp Scalable Shared Memory Multiprocessor. In Proceedings of the International Conference on Parallel Processing, volume I, pages 1--10, Aug. 1995.]]
[35]
D. Poulsen and P.-C. Yew. Data Prefetching and Data Forwarding in Shared-Memory Multiprocessors. In Proceedings of the International Conference on Parallel Processing, volume II, pages 296--280, Aug. 1994.]]
[36]
I. Pragaspathy and B. Falsafi. Address Partitioning in DSM Clusters with Parallel Coherence Controllers. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, Oct. 2000.]]
[37]
E. Rosti, E. Smirni, T. Wagner, A. Apon, and L. Dowdy. The KSR1: Experimentation and Modeling of Poststore. In Proceedings of the 1993 ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, pages 74--85, May 1993.]]
[38]
X. Shen, Arvind, and L. Rudolph. CACHET: An Adaptive Cache Coherence Protocol for Distributed Shared-Memory Systems. In Proceedings of the 1999 International Conference on Supercomputing, pages 135--144, June 1998.]]
[39]
P. Stenström. A Cache Consistency Protocol for Multiprocessors with Multistage Networks. In Proceedings of the 16th Annual International Symposium on Computer Architecture, May 1989.]]
[40]
P. Stenström, M. Brorsson, and L. Sandberg. Adaptive Cache Coherence Protocol Optimized for Migratory Sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 109--118, May 1993.]]
[41]
P. Sweazey and A. J. Smith. A Class of Compatible Cache Consistency Protocols and their Support by the IEEE Futurebus. In Proceedings of the 13th Annual International Symposium on Computer Architecture, pages 414--423, June 1986.]]
[42]
J. M. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. POWER4 System Microarchitecture. IBM Server Group Whitepaper, Oct. 2001.]]
[43]
K. C. Yeager. The MIPS R10000 Superscalar Microprocessor. IEEE Micro, 16(2):28--40, Apr. 1996.]]

Cited By

View all
  • (2023)Transient-Execution Attacks: A Computer Architect PerspectiveACM Computing Surveys10.1145/360361956:3(1-38)Online publication date: 6-Oct-2023
  • (2023)Fine-grain data classification to filter token coherence trafficJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.09.004171(40-53)Online publication date: Jan-2023
  • (2022)Cache Coherence for Embedded Multi-core System Architectures: A Survey and ChallengesIoT Based Control Networks and Intelligent Systems10.1007/978-981-19-5845-8_49(689-702)Online publication date: 12-Oct-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News
ACM SIGARCH Computer Architecture News  Volume 31, Issue 2
ISCA 2003
May 2003
422 pages
ISSN:0163-5964
DOI:10.1145/871656
Issue’s Table of Contents
  • cover image ACM Conferences
    ISCA '03: Proceedings of the 30th annual international symposium on Computer architecture
    June 2003
    432 pages
    ISBN:0769519458
    DOI:10.1145/859618
    • Conference Chair:
    • Allan Gottlieb,
    • Program Chair:
    • Kai Li

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 2003
Published in SIGARCH Volume 31, Issue 2

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)44
  • Downloads (Last 6 weeks)5
Reflects downloads up to 13 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Transient-Execution Attacks: A Computer Architect PerspectiveACM Computing Surveys10.1145/360361956:3(1-38)Online publication date: 6-Oct-2023
  • (2023)Fine-grain data classification to filter token coherence trafficJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.09.004171(40-53)Online publication date: Jan-2023
  • (2022)Cache Coherence for Embedded Multi-core System Architectures: A Survey and ChallengesIoT Based Control Networks and Intelligent Systems10.1007/978-981-19-5845-8_49(689-702)Online publication date: 12-Oct-2022
  • (2021)A Novel Hybrid Cache Coherence with Global Snooping for Many-core ArchitecturesACM Transactions on Design Automation of Electronic Systems10.1145/346277527:1(1-31)Online publication date: 13-Sep-2021
  • (2021)Interconnect Modeling for Homogeneous and Heterogeneous MultiprocessorsNetwork-on-Chip Security and Privacy10.1007/978-3-030-69131-8_2(31-54)Online publication date: 22-Jan-2021
  • (2017)An adaptive cache coherence protocolJournal of Parallel and Distributed Computing10.1016/j.jpdc.2016.12.020102:C(163-174)Online publication date: 1-Apr-2017
  • (2015)Evaluation of Cache Coherence Mechanisms for Multicore ProcessorsComputational Intelligence and Efficiency in Engineering Systems10.1007/978-3-319-15720-7_22(307-320)Online publication date: 11-Mar-2015
  • (2014)Path-Based Partitioning Methods for 3D Networks-on-Chip with Minimal Adaptive RoutingIEEE Transactions on Computers10.1109/TC.2012.25563:3(718-733)Online publication date: 1-Mar-2014
  • (2013)Enabling power efficiency through dynamic rerouting on-chipACM Transactions on Embedded Computing Systems10.1145/2485984.248599912:4(1-23)Online publication date: 3-Jul-2013
  • (2013)An efficient, low-cost routing framework for convex mesh partitions to support virtualizationACM Transactions on Embedded Computing Systems10.1145/2485984.248599512:4(1-24)Online publication date: 3-Jul-2013
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media