Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Free access

An evaluation of memory consistency models for shared-memory systems with ILP processors

Published: 01 September 1996 Publication History
  • Get Citation Alerts
  • Abstract

    Relaxed consistency models have been shown to significantly outperform sequential consistency for single-issue, statically scheduled processors with blocking reads. However, current microprocessors aggressively exploit instruction-level parallelism (ILP) using methods such as multiple issue, dynamic scheduling, and non-blocking reads. Researchers have conjectured that two techniques, hardware-controlled non-binding prefetching and speculative loads, have the potential to equalize the hardware performance of memory consistency models on such processors.This paper performs the first detailed quantitative comparison of several implementations of sequential consistency and release consistency optimized for aggressive ILP processors. Our results indicate that hardware prefetching and speculative loads dramatically improve the performance of sequential consistency. However, the gap between sequential consistency and release consistency depends on the cache write policy and the complexity of the cache-coherence protocol implementation. In most cases, release consistency significantly outperforms sequential consistency, but for two applications, the use of a write-back primary cache and a more complex cache-coherence protocol nearly equalizes the performance of the two models.We also observe that the existing techniques, which require on-chip hardware modifications, enhance the performance of release consistency only to a small extent. We propose two new software techniques --- fuzzy acquires and selective acquires --- to achieve more overlap than allowed by the previous implementations of release consistency. To enhance methods for overlapping acquires, we also propose a technique to eliminate control dependences caused by an acquire loop, using a small amount of off-chip hardware called the synchronization buffer.

    References

    [1]
    S. V. Adve, A. L. Cox, S. Dwarkadas, and W. Zwaenepoel. Replacing Locks by Higher-Level Primitives. Technical Report TR94-237, Computer Science, Rice University, 1994.
    [2]
    S. V. Adve and M. D. Hill. A Unified Formalization of Four Shared-Memory Models. IEEE Trans. on Parallel and Distributed Systems, 4(6):613-624, June 1993.
    [3]
    A. Agarwal et al. The MIT Alewife Machine: Architecture and Performance. In Proc. of the ~~nd ISUA, pages 2-13, 1995.
    [4]
    R. Alverson et al. The Tera Computer System. In Proc. of the Intl. Conf. on Supercomputing, pages 1-6, 1990.
    [5]
    B. N. Bershad, M. J. Zekauskas, and W. A. Sawdon. The Midway Distributed Shared Memory System. Compcon, 1992.
    [6]
    R. G. Covington et el. The Efficient Simulation of Paraffl~l Computer Systems. Intl. Journal of Computer Simulation, 1:31-58, January 1991.
    [7]
    F. Dahlgren and P. Stenstrom. Effectiveness of Hardware- Based Stride and Sequential Prefetching in Shared-Memory Multiprocessora. In Proc. of the 1st Intl. Syrup. on High Performance Computer Architecture, 1995.
    [8]
    K. Gharachorloo et el. Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors. In Proc. of the 17th ISCA, pages 15-26, May 1990.
    [9]
    K. Gharachorloo, A. Gupta, and J. Hennessy. Performance Evaluation of Memory Consistency Models for Shared- Memory Multiprocessors. In Proc. of ASPLOS IV, pages 245-257, 1991.
    [10]
    K. Gharachorloo, A. Gupta, and J. Hennessy. Two Techniques to Enhance the Performance of Memory Consistency Models. In Proc. of the Intk Conj. on Parallel Processing, pages 1355-i364, 1991.
    [11]
    K. Gharachorloo, A. Gupta, and J. Hennessy. Hiding Memory Latency Using Dynamic Scheduling in Shared-Memory Multiprocessors. In Proc. of the 19th ISCA, pages 22-33, 1992.
    [12]
    J. R. Goodman, M. K. Vernon, and P. J. Woest. Efficient Synchronization Primitives for Large-Scale Cache-Coherent Multiprocessors. In Proc. of ASPLO$ III, pages 64-75, 1989.
    [13]
    A. Gupta et el. Comparative Evaluation of Latency Reducing and Tolerating Techniques. In Proc. of the 18th ISCA, pages 254-263, May 1991.
    [14]
    R. Gupta. The Fuzzy Barrier: A Mechanism for High Speed Synchronization of Processors. In Proc. of ASPLOS iII, pages 54-63, April 1989.
    [15]
    D. Hunt. Advanced Features of the 64-bit PA-8000. Hewlett Packard, 1996.
    [16]
    Intel Corporation. Pentium (r) Pro Family Developer's Manual
    [17]
    D. Kroft. Lockup-Free Instruction Fetch/Prefetch Cache Organization. In Proc. of the 8th ISCA, pages 81-87, 1981.
    [18]
    L. Lamport. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. IEEE Trans. on Computers, C-28(9):690-691, September 1979.
    [19]
    MIPS Technologies, Inc. RIO00 Microprocessor User's Manual, Version 1.1, January 1996.
    [20]
    T. Mowry and A. Gupta. Tolerating Latency Through Software-Controlled Prefetching. JPDC, pages 87-106, June 1991.
    [21]
    V. S. Pat, P. Ranganathan, and S. V. Adve. The Impact of Instruction-Level Parallelism on Multiprocessor Performance and Simulation Methodolgy. Technical report, Rice University, July 1996.
    [22]
    U. Rajagopalan. The Effects of Interconnection Networks on the Performance of Shared-Memory Multiprocessors. Master's thesis, Rice University, January 1995.
    [23]
    M. Rosenblum et el. The Impact of Architectural Trends on Operating System Performance. In Proc. of the 15th Syrup. on Operating Systems Principles, pages 285-298, 1995.
    [24]
    J.P. Singh, W.-D. Weber, and A. Gupta. SPLASH: Stanford Parallel Applications for Sharecl-Memory. Computer Architecture News, 20(1):5-44, March 1992.
    [25]
    Spare International. The $PARC Architecture Manual, 1993. Version 9.
    [26]
    S. C. Woo et el. The SPLASH-2 Programs: Characterization and Methodological Considerations. In Proc. of the 22nd ISCA, pages 24-36, 1995.
    [27]
    R. N. Zucker and J.-L. Beer. A Performance Study of Memory Consistency Models. In Proc. of the i9th ISCA, pages 2-12, 1992.

    Cited By

    View all

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 31, Issue 9
    Sept. 1996
    273 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/248209
    Issue’s Table of Contents
    • cover image ACM Conferences
      ASPLOS VII: Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
      October 1996
      290 pages
      ISBN:0897917677
      DOI:10.1145/237090
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 September 1996
    Published in SIGPLAN Volume 31, Issue 9

    Check for updates

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)64
    • Downloads (Last 6 weeks)14
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2008)The revolution inside the boxCommunications of the ACM10.1145/1364782.136479951:7(70-78)Online publication date: 1-Jul-2008
    • (2005)Cost estimation of coherence protocols of software managed cache on distributed shared memory systemHigh Performance Computing10.1007/BFb0024228(335-342)Online publication date: 9-Jun-2005
    • (2020)Speculative Enforcement of Store Atomicity2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00053(555-567)Online publication date: Oct-2020
    • (2017)Hiding the Long Latency of Persist Barriers Using Speculative ExecutionACM SIGARCH Computer Architecture News10.1145/3140659.308024045:2(175-186)Online publication date: 24-Jun-2017
    • (2017)Using Thermal Stimuli to Enhance Photo-Sharing in Social MediaProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/30900501:2(1-21)Online publication date: 30-Jun-2017
    • (2017)Hiding the Long Latency of Persist Barriers Using Speculative ExecutionProceedings of the 44th Annual International Symposium on Computer Architecture10.1145/3079856.3080240(175-186)Online publication date: 24-Jun-2017
    • (2016)Ranking-Oriented Collaborative FilteringACM Transactions on Information Systems10.1145/296040835:2(1-28)Online publication date: 21-Sep-2016
    • (2015)Efficiently enforcing strong memory ordering in GPUsProceedings of the 48th International Symposium on Microarchitecture10.1145/2830772.2830778(699-712)Online publication date: 5-Dec-2015
    • (2012)Balancing Programmability and Silicon Efficiency of Heterogeneous Multicore ArchitecturesACM Transactions on Embedded Computing Systems10.1145/2180887.218089011S:1(1-32)Online publication date: 1-Jun-2012
    • (2009)Incorporating constraints in probabilistic XMLACM Transactions on Database Systems10.1145/1567274.156728034:3(1-45)Online publication date: 3-Sep-2009
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media