research-article

DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations

Authors:

Sarita V. AdveAuthors Info & Claims

ACM SIGPLAN Notices, Volume 50, Issue 4

Pages 545 - 559

https://doi.org/10.1145/2775054.2694356

Published: 14 March 2015 Publication History

Abstract

Current shared-memory hardware is complex and inefficient. Prior work on the DeNovo coherence protocol showed that disciplined shared-memory programming models can enable more complexity-, performance-, and energy-efficient hardware than the state-of-the-art MESI protocol. DeNovo, however, severely restricted the synchronization constructs an application can support. This paper proposes DeNovoSync, a technique to support arbitrary synchronization in DeNovo. The key challenge is that DeNovo exploits race-freedom to use reader-initiated local self-invalidations (instead of conventional writer-initiated remote cache invalidations) to ensure coherence. Synchronization accesses are inherently racy and not directly amenable to self-invalidations. DeNovoSync addresses this challenge using a novel combination of registration of all synchronization reads with a judicious hardware backoff to limit unnecessary registrations. For a wide variety of synchronization constructs and applications, compared to MESI, DeNovoSync shows comparable or up to 22% lower execution time and up to 58% lower network traffic, enabling DeNovo's advantages for a much broader class of software than previously possible.

References

[1]

S. V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. IEEE Computer, 29 (12): 66--76, 1996.

Digital Library

[2]

A. Agarwal and M. Cherian. Adaptive backoff synchronization techniques. In Proceedings of the 16th Annual International Symposium on Computer Architecture, ISCA '89, 1989.

Digital Library

[3]

N. Agarwal, T. Krishna, L.-S. Peh, and N. Jha. Garnet: A detailed interconnection network model inside a full-system simulation framework. Technical Report CE-P08-001, Princeton University, 2008. URL http://www.princeton.edu/~niketa/garnet.

[4]

T. E. Anderson. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst., 1 (1), Jan. 1990.

Digital Library

[5]

B. Bershad, M. Zekauskas, and W. Sawdon. The midway distributed shared memory system. In Compcon Spring '93, Digest of Papers., Feb 1993.

[6]

C. Bienia. Benchmarking Modern Multiprocessors. PhD thesis, Princeton University, Jan. 2011.

Digital Library

[7]

R. L. Bocchino, Jr., V. S. Adve, D. Dig, S. V. Adve, S. Heumann, R. Komuravelli, J. Overbey, P. Simmons, H. Sung, and M. Vakilian. A type and effect system for deterministic parallel java. In Proceedings of the 24th ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '09, 2009.

Digital Library

[8]

R. L. Bocchino, Jr., S. Heumann, N. Honarmand, S. V. Adve, V. S. Adve, A. Welc, and T. Shpeisman. Safe nondeterminism in a deterministic-by-default parallel language. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '11, 2011.

Digital Library

[9]

H.-J. Boehm and S. V. Adve. Foundations of the c

[10]

concurrency memory model. In Proceedings of the 2008 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '08, 2008.

[11]

B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, S. V. Adve, V. S. Adve, N. P. Carter, and C.-T. Chou. Denovo: Rethinking the memory hierarchy for disciplined parallelism. In Proceedings of the 20th International Conference on Parallel Architectures and Compilation Techniques, PACT '11, 2011.

Digital Library

[12]

M. Elver and V. Nagarajan. Tso-cc: Consistency directed cache coherence for tso. In IEEE 20th International Symposium on High Performance Computer Architecture, HPCA-20, Feb 2014.

[13]

J. R. Goodman and P. J. Woest. The wisconsin multicube: A new large-scale cache-coherent multiprocessor. In Proceedings of the 15th Annual International Symposium on Computer Architecture, ISCA '88, 1988.

Digital Library

[14]

J. R. Goodman, M. K. Vernon, and P. J. Woest. Efficient synchronization primitives for large-scale cache-coherent multiprocessors. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS III, 1989.

Digital Library

[15]

M. Herlihy. A methodology for implementing highly concurrent data structures. In Proceedings of the Second ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming, PPOPP '90, 1990.

Digital Library

[16]

M. D. Hill, J. R. Larus, S. K. Reinhardt, and D. A. Wood. Cooperative shared memory: Software and hardware for scalable multiprocessor. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS V, 1992.

Digital Library

[17]

L. Iftode, J. P. Singh, and K. Li. Scope consistency: A bridge between release consistency and entry consistency. In Proceedings of the Eighth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '96, 1996.

Digital Library

[18]

S. Kaxiras and G. Keramidas. SARC Coherence: Scaling Directory Cache Coherence in Performance and Power. IEEE Micro, 30 (5), Sept.-Oct. 2010.

Digital Library

[19]

P. Keleher, A. L. Cox, and W. Zwaenepoel. Lazy Release Consistency for Software Distributed Shared Memory. In Proceedings of the 19th Annual International Symposium on Computer Architecture, ISCA '92, 1992.

Digital Library

[20]

J. H. Kelm, D. R. Johnson, M. R. Johnson, N. C. Crago, W. Tuohy, A. Mahesri, S. S. Lumetta, M. I. Frank, and S. J. Patel. Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator. In Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA '09, 2009.

Digital Library

[21]

J. H. Kelm, D. R. Johnson, W. Tuohy, S. S. Lumetta, and S. J. Patel. Cohesion: A Hybrid Memory Model for Accelerators. In Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA '10, 2010.

Digital Library

[22]

R. Komuravelli, S. V. Adve, and C.-T. Chou. Revisiting the complexity of hardware cache coherence and some implications. ACM Trans. Archit. Code Optim., Dec. 2014.

Digital Library

[23]

D. Koufaty, X. Chen, D. Poulsen, and J. Torrellas. Data forwarding in scalable shared-memory multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 7 (12), dec 1996.

Digital Library

[24]

J. R. Larus, S. Chandra, and D. A. Wood. Cico: A practical shared-memory programming performance model. In Workshop on Portability and Performance for Parallel Processing, 1993.

[25]

A. R. Lebeck and D. A. Wood. Dynamic self-invalidation: Reducing coherence overhead in shared-memory multiprocessors. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, ISCA '95, 1995.

Digital Library

[26]

P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. Computer, 35: 50--58, 2002.

Digital Library

[27]

J. Manson, W. Pugh, and S. V. Adve. The java memory model. In Proceedings of the 32Nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '05, 2005.

Digital Library

[28]

M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet's General Execution-driven Multiprocessor Simulator (GEMS) Toolset. SIGARCH Computer Architecture News, 33 (4): 92--99, 2005.

Digital Library

[29]

M. M. Michael and M. L. Scott. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing, PODC '96, 1996.

Digital Library

[30]

M. M. Michael and M. L. Scott. Nonblocking algorithms and preemption-safe locking on multiprogrammed shared memory multiprocessors. J. Parallel Distrib. Comput., 51 (1), May 1998.

Digital Library

[31]

S. L. Min and J.-L. Baer. Design and analysis of a scalable cache coherence scheme based on clocks and timestamps. IEEE Trans. on Parallel and Distributed Systems, 3 (2): 25--44, January 1992.

Digital Library

[32]

R. Rajwar, A. Kagi, and J. Goodman. Improving the throughput of synchronization by insertion of delays. In Proceedings of the Sixth International Symposium on High-Performance Computer Architecture, HPCA-6, 2000.

[33]

A. Ros and S. Kaxiras. Complexity-effective multicore coherence. In Proceedings of the 21st international conference on Parallel architectures and compilation techniques, PACT '12, 2012.

Digital Library

[34]

M. Scott. Shared Memory Synchronization. Synthesis Lectures on Computer Architecture. Morgan & Claypool, 2013. ISBN 9781608459568. URL http://books.google.com/books?id=N4YcnQEACAAJ.

Digital Library

[35]

S. Subramaniam, S. C. Steely, W. Hasenplaugh, A. Jaleel, C. Beckmann, T. Fossum, and J. Emer. Using in-flight chains to build a scalable cache coherence protocol. ACM Trans. Archit. Code Optim., 10 (4), Dec. 2013.

Digital Library

[36]

H. Sung, R. Komuravelli, and S. V. Adve. DeNovoND: efficient hardware support for disciplined non-determinism. In Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems, ASPLOS '13, 2013.

Digital Library

[37]

H. Sung, R. Komuravelli, and S. V. Adve. DeNovoND: efficient hardware for disciplined nondeterminism. IEEE Micro, 34 (3), 2014.

[38]

S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The splash-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, ISCA '95, 1995.

Digital Library

[39]

J. Zebchuk, V. Srinivasan, M. K. Qureshi, and A. Moshovos. A tagless coherence directory. In Proceedings of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, 2009.

Digital Library

Cited By

Abdulla PAtig MKaxiras SLeonardsson CRos AZhu Y(2016)Fencing Programs with Self-Invalidation and Self-Downgrade36th IFIP WG 6.1 International Conference on Formal Techniques for Distributed Objects, Components, and Systems - Volume 968810.1007/978-3-319-39570-8_2(19-35)Online publication date: 6-Jun-2016
https://dl.acm.org/doi/10.1007/978-3-319-39570-8_2
Biswas SZhang RBond MLucia B(2019)Rethinking Support for Region Conflict Exceptions2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2019.00116(1095-1106)Online publication date: May-2019
https://doi.org/10.1109/IPDPS.2019.00116
Jimborean AEkemark PWaern JKaxiras SRos A(2018)Automatic Detection of Large Extended Data-Race-Free Regions with Conflict IsolationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.277150929:3(527-541)Online publication date: 1-Mar-2018
https://doi.org/10.1109/TPDS.2017.2771509
Show More Cited By

Index Terms

DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
2. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations
ASPLOS'15

Current shared-memory hardware is complex and inefficient. Prior work on the DeNovo coherence protocol showed that disciplined shared-memory programming models can enable more complexity-, performance-, and energy-efficient hardware than the state-of-...
DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations
ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems

Current shared-memory hardware is complex and inefficient. Prior work on the DeNovo coherence protocol showed that disciplined shared-memory programming models can enable more complexity-, performance-, and energy-efficient hardware than the state-of-...
SWEL: hardware cache coherence protocols to map shared data onto shared caches
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques

Snooping and directory-based coherence protocols have become the de facto standard in chip multi-processors, but neither design is without drawbacks. Snooping protocols are not scalable, while directory protocols incur directory storage overhead, ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 50, Issue 4

ASPLOS '15

April 2015

676 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/2775054

Editor:
Andy Gill
University of Kansas, Lawrence, KS

Issue’s Table of Contents

ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems
March 2015
720 pages
ISBN:9781450328357
DOI:10.1145/2694344
General Chairs:
Ozcan Ozturk
Bilkent University, Turkey
,
Kemal Ebcioglu
Global Supercomputing, USA
,
Program Chair:
Sandhya Dwarkadas
University of Rochester, USA

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 March 2015

Published in SIGPLAN Volume 50, Issue 4

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation
Illinois/Intel Parallelism Center at Illinois (I2PC)
the Center for Future Architectures Research (C-FAR)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
473
Total Downloads

Downloads (Last 12 months)30
Downloads (Last 6 weeks)5

Reflects downloads up to 12 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Abdulla PAtig MKaxiras SLeonardsson CRos AZhu Y(2016)Fencing Programs with Self-Invalidation and Self-Downgrade36th IFIP WG 6.1 International Conference on Formal Techniques for Distributed Objects, Components, and Systems - Volume 968810.1007/978-3-319-39570-8_2(19-35)Online publication date: 6-Jun-2016
https://dl.acm.org/doi/10.1007/978-3-319-39570-8_2
Biswas SZhang RBond MLucia B(2019)Rethinking Support for Region Conflict Exceptions2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2019.00116(1095-1106)Online publication date: May-2019
https://doi.org/10.1109/IPDPS.2019.00116
Jimborean AEkemark PWaern JKaxiras SRos A(2018)Automatic Detection of Large Extended Data-Race-Free Regions with Conflict IsolationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.277150929:3(527-541)Online publication date: 1-Mar-2018
https://doi.org/10.1109/TPDS.2017.2771509
Alsop JSinclair MAdve S(2018)SpandexProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00031(261-274)Online publication date: 2-Jun-2018
https://dl.acm.org/doi/10.1109/ISCA.2018.00031
Zhang SVijayaraghavan MWright AAlipour MArvind (2018)Constructing a weak memory modelProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00021(124-137)Online publication date: 2-Jun-2018
https://dl.acm.org/doi/10.1109/ISCA.2018.00021
Jimborean AWaern JEkemark PKaxiras SRos AReddi VSmith ATang L(2017)Automatic detection of extended data-race-free regionsProceedings of the 2017 International Symposium on Code Generation and Optimization10.5555/3049832.3049835(14-26)Online publication date: 4-Feb-2017
https://dl.acm.org/doi/10.5555/3049832.3049835
Sinclair MAlsop JAdve S(2017)Chasing Away RAtsACM SIGARCH Computer Architecture News10.1145/3140659.308020645:2(161-174)Online publication date: 24-Jun-2017
https://dl.acm.org/doi/10.1145/3140659.3080206
Sinclair MAlsop JAdve S(2017)Chasing Away RAtsProceedings of the 44th Annual International Symposium on Computer Architecture10.1145/3079856.3080206(161-174)Online publication date: 24-Jun-2017
https://dl.acm.org/doi/10.1145/3079856.3080206
Ros ALeonardsson CSakalis CKaxiras S(2017)Efficient Self-Invalidation/Self-Downgrade for Critical Sections with Relaxed SemanticsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.272074428:12(3413-3425)Online publication date: 1-Dec-2017
https://doi.org/10.1109/TPDS.2017.2720744
Yao YChen WMitra TXiang Y(2017)TC-Release++: An Efficient Timestamp-Based Coherence Protocol for Many-Core ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.271967928:11(3313-3327)Online publication date: 1-Nov-2017
https://doi.org/10.1109/TPDS.2017.2719679
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents