Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2694344.2694356acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations

Published: 14 March 2015 Publication History
  • Get Citation Alerts
  • Abstract

    Current shared-memory hardware is complex and inefficient. Prior work on the DeNovo coherence protocol showed that disciplined shared-memory programming models can enable more complexity-, performance-, and energy-efficient hardware than the state-of-the-art MESI protocol. DeNovo, however, severely restricted the synchronization constructs an application can support. This paper proposes DeNovoSync, a technique to support arbitrary synchronization in DeNovo. The key challenge is that DeNovo exploits race-freedom to use reader-initiated local self-invalidations (instead of conventional writer-initiated remote cache invalidations) to ensure coherence. Synchronization accesses are inherently racy and not directly amenable to self-invalidations. DeNovoSync addresses this challenge using a novel combination of registration of all synchronization reads with a judicious hardware backoff to limit unnecessary registrations. For a wide variety of synchronization constructs and applications, compared to MESI, DeNovoSync shows comparable or up to 22% lower execution time and up to 58% lower network traffic, enabling DeNovo's advantages for a much broader class of software than previously possible.

    References

    [1]
    S. V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. IEEE Computer, 29 (12): 66--76, 1996.
    [2]
    A. Agarwal and M. Cherian. Adaptive backoff synchronization techniques. In Proceedings of the 16th Annual International Symposium on Computer Architecture, ISCA '89, 1989.
    [3]
    N. Agarwal, T. Krishna, L.-S. Peh, and N. Jha. Garnet: A detailed interconnection network model inside a full-system simulation framework. Technical Report CE-P08-001, Princeton University, 2008. URL http://www.princeton.edu/~niketa/garnet.
    [4]
    T. E. Anderson. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst., 1 (1), Jan. 1990.
    [5]
    B. Bershad, M. Zekauskas, and W. Sawdon. The midway distributed shared memory system. In Compcon Spring '93, Digest of Papers., Feb 1993.
    [6]
    C. Bienia. Benchmarking Modern Multiprocessors. PhD thesis, Princeton University, Jan. 2011.
    [7]
    R. L. Bocchino, Jr., V. S. Adve, D. Dig, S. V. Adve, S. Heumann, R. Komuravelli, J. Overbey, P. Simmons, H. Sung, and M. Vakilian. A type and effect system for deterministic parallel java. In Proceedings of the 24th ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '09, 2009.
    [8]
    R. L. Bocchino, Jr., S. Heumann, N. Honarmand, S. V. Adve, V. S. Adve, A. Welc, and T. Shpeisman. Safe nondeterminism in a deterministic-by-default parallel language. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '11, 2011.
    [9]
    H.-J. Boehm and S. V. Adve. Foundations of the c
    [10]
    concurrency memory model. In Proceedings of the 2008 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '08, 2008.
    [11]
    B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, S. V. Adve, V. S. Adve, N. P. Carter, and C.-T. Chou. Denovo: Rethinking the memory hierarchy for disciplined parallelism. In Proceedings of the 20th International Conference on Parallel Architectures and Compilation Techniques, PACT '11, 2011.
    [12]
    M. Elver and V. Nagarajan. Tso-cc: Consistency directed cache coherence for tso. In IEEE 20th International Symposium on High Performance Computer Architecture, HPCA-20, Feb 2014.
    [13]
    J. R. Goodman and P. J. Woest. The wisconsin multicube: A new large-scale cache-coherent multiprocessor. In Proceedings of the 15th Annual International Symposium on Computer Architecture, ISCA '88, 1988.
    [14]
    J. R. Goodman, M. K. Vernon, and P. J. Woest. Efficient synchronization primitives for large-scale cache-coherent multiprocessors. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS III, 1989.
    [15]
    M. Herlihy. A methodology for implementing highly concurrent data structures. In Proceedings of the Second ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming, PPOPP '90, 1990.
    [16]
    M. D. Hill, J. R. Larus, S. K. Reinhardt, and D. A. Wood. Cooperative shared memory: Software and hardware for scalable multiprocessor. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS V, 1992.
    [17]
    L. Iftode, J. P. Singh, and K. Li. Scope consistency: A bridge between release consistency and entry consistency. In Proceedings of the Eighth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '96, 1996.
    [18]
    S. Kaxiras and G. Keramidas. SARC Coherence: Scaling Directory Cache Coherence in Performance and Power. IEEE Micro, 30 (5), Sept.-Oct. 2010.
    [19]
    P. Keleher, A. L. Cox, and W. Zwaenepoel. Lazy Release Consistency for Software Distributed Shared Memory. In Proceedings of the 19th Annual International Symposium on Computer Architecture, ISCA '92, 1992.
    [20]
    J. H. Kelm, D. R. Johnson, M. R. Johnson, N. C. Crago, W. Tuohy, A. Mahesri, S. S. Lumetta, M. I. Frank, and S. J. Patel. Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator. In Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA '09, 2009.
    [21]
    J. H. Kelm, D. R. Johnson, W. Tuohy, S. S. Lumetta, and S. J. Patel. Cohesion: A Hybrid Memory Model for Accelerators. In Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA '10, 2010.
    [22]
    R. Komuravelli, S. V. Adve, and C.-T. Chou. Revisiting the complexity of hardware cache coherence and some implications. ACM Trans. Archit. Code Optim., Dec. 2014.
    [23]
    D. Koufaty, X. Chen, D. Poulsen, and J. Torrellas. Data forwarding in scalable shared-memory multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 7 (12), dec 1996.
    [24]
    J. R. Larus, S. Chandra, and D. A. Wood. Cico: A practical shared-memory programming performance model. In Workshop on Portability and Performance for Parallel Processing, 1993.
    [25]
    A. R. Lebeck and D. A. Wood. Dynamic self-invalidation: Reducing coherence overhead in shared-memory multiprocessors. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, ISCA '95, 1995.
    [26]
    P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. Computer, 35: 50--58, 2002.
    [27]
    J. Manson, W. Pugh, and S. V. Adve. The java memory model. In Proceedings of the 32Nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '05, 2005.
    [28]
    M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet's General Execution-driven Multiprocessor Simulator (GEMS) Toolset. SIGARCH Computer Architecture News, 33 (4): 92--99, 2005.
    [29]
    M. M. Michael and M. L. Scott. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing, PODC '96, 1996.
    [30]
    M. M. Michael and M. L. Scott. Nonblocking algorithms and preemption-safe locking on multiprogrammed shared memory multiprocessors. J. Parallel Distrib. Comput., 51 (1), May 1998.
    [31]
    S. L. Min and J.-L. Baer. Design and analysis of a scalable cache coherence scheme based on clocks and timestamps. IEEE Trans. on Parallel and Distributed Systems, 3 (2): 25--44, January 1992.
    [32]
    R. Rajwar, A. Kagi, and J. Goodman. Improving the throughput of synchronization by insertion of delays. In Proceedings of the Sixth International Symposium on High-Performance Computer Architecture, HPCA-6, 2000.
    [33]
    A. Ros and S. Kaxiras. Complexity-effective multicore coherence. In Proceedings of the 21st international conference on Parallel architectures and compilation techniques, PACT '12, 2012.
    [34]
    M. Scott. Shared Memory Synchronization. Synthesis Lectures on Computer Architecture. Morgan & Claypool, 2013. ISBN 9781608459568. URL http://books.google.com/books?id=N4YcnQEACAAJ.
    [35]
    S. Subramaniam, S. C. Steely, W. Hasenplaugh, A. Jaleel, C. Beckmann, T. Fossum, and J. Emer. Using in-flight chains to build a scalable cache coherence protocol. ACM Trans. Archit. Code Optim., 10 (4), Dec. 2013.
    [36]
    H. Sung, R. Komuravelli, and S. V. Adve. DeNovoND: efficient hardware support for disciplined non-determinism. In Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems, ASPLOS '13, 2013.
    [37]
    H. Sung, R. Komuravelli, and S. V. Adve. DeNovoND: efficient hardware for disciplined nondeterminism. IEEE Micro, 34 (3), 2014.
    [38]
    S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The splash-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, ISCA '95, 1995.
    [39]
    J. Zebchuk, V. Srinivasan, M. K. Qureshi, and A. Moshovos. A tagless coherence directory. In Proceedings of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, 2009.

    Cited By

    View all
    • (2019)Rethinking Support for Region Conflict Exceptions2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2019.00116(1095-1106)Online publication date: May-2019
    • (2018)Automatic Detection of Large Extended Data-Race-Free Regions with Conflict IsolationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.277150929:3(527-541)Online publication date: 1-Mar-2018
    • (2018)SpandexProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00031(261-274)Online publication date: 2-Jun-2018
    • Show More Cited By

    Index Terms

    1. DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems
        March 2015
        720 pages
        ISBN:9781450328357
        DOI:10.1145/2694344
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        In-Cooperation

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 14 March 2015

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. cache coherence
        2. consistency
        3. shared memory
        4. synchronization

        Qualifiers

        • Research-article

        Funding Sources

        Conference

        ASPLOS '15

        Acceptance Rates

        ASPLOS '15 Paper Acceptance Rate 48 of 287 submissions, 17%;
        Overall Acceptance Rate 535 of 2,713 submissions, 20%

        Upcoming Conference

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)31
        • Downloads (Last 6 weeks)3

        Other Metrics

        Citations

        Cited By

        View all
        • (2019)Rethinking Support for Region Conflict Exceptions2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2019.00116(1095-1106)Online publication date: May-2019
        • (2018)Automatic Detection of Large Extended Data-Race-Free Regions with Conflict IsolationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.277150929:3(527-541)Online publication date: 1-Mar-2018
        • (2018)SpandexProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00031(261-274)Online publication date: 2-Jun-2018
        • (2018)Constructing a weak memory modelProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00021(124-137)Online publication date: 2-Jun-2018
        • (2017)Automatic detection of extended data-race-free regionsProceedings of the 2017 International Symposium on Code Generation and Optimization10.5555/3049832.3049835(14-26)Online publication date: 4-Feb-2017
        • (2017)Chasing Away RAtsACM SIGARCH Computer Architecture News10.1145/3140659.308020645:2(161-174)Online publication date: 24-Jun-2017
        • (2017)Chasing Away RAtsProceedings of the 44th Annual International Symposium on Computer Architecture10.1145/3079856.3080206(161-174)Online publication date: 24-Jun-2017
        • (2017)Efficient Self-Invalidation/Self-Downgrade for Critical Sections with Relaxed SemanticsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.272074428:12(3413-3425)Online publication date: 1-Dec-2017
        • (2017)TC-Release++: An Efficient Timestamp-Based Coherence Protocol for Many-Core ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.271967928:11(3313-3327)Online publication date: 1-Nov-2017
        • (2016)RacerThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195678(1-13)Online publication date: 15-Oct-2016
        • Show More Cited By

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media