research-article

DeNovoND: efficient hardware support for disciplined non-determinism

Authors:

Rakesh Komuravelli,

Sarita V. AdveAuthors Info & Claims

ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems

Pages 13 - 26

https://doi.org/10.1145/2451116.2451119

Published: 16 March 2013 Publication History

Abstract

Recent work has shown that disciplined shared-memory programming models that provide deterministic-by-default semantics can simplify both parallel software and hardware. Specifically, the DeNovo hardware system has shown that the software guarantees of such models (e.g., data-race-freedom and explicit side-effects) can enable simpler, higher performance, and more energy-efficient hardware than the current state-of-the-art for deterministic programs. Many applications, however, contain non-deterministic parts; e.g., using lock synchronization. For commercial hardware to exploit the benefits of DeNovo, it is therefore necessary to extend DeNovo to support non-deterministic applications.

This paper proposes DeNovoND, a system that supports lock-based, disciplined non-determinism, with the simplicity, performance, and energy benefits of DeNovo. We use a combination of distributed queue-based locks and access signatures to implement simple memory consistency semantics for safe non-determinism, with a coherence protocol that does not require transient states, invalidation traffic, or directories, and does not incur false sharing. The resulting system is simpler, shows comparable or better execution time, and has 33% less network traffic on average (translating directly into energy savings) relative to a state-of-the-art invalidation-based protocol for 8 applications designed for lock synchronization.

References

[1]

S. Adve and H.-J. Boehm. Memory Models: A Case for Rethinking Parallel Languages and Hardware. CACM, Aug. 2010.

Digital Library

[2]

S. Adve and M. Hill. Weak Ordering - A New Definition. In ISCA, 1990.

Digital Library

[3]

N. Agarwal, T. Krishna, L.-S. Peh, and N. Jha. GARNET: A Detailed On-Chip Network Model inside a Full-System Simulator. In ISPASS, 2009.

[4]

M. Allen, S. Sridharan, and G. Sohi. Serialization Sets: A Dynamic Dependence-based Parallel Execution Model. In PPoPP, 2009.

Digital Library

[5]

Z. Anderson, D. Gay, R. Ennals, and E. Brewer. SharC: Checking Data Sharing Strategies for Multithreaded C. In PLDI, 2008.

Digital Library

[6]

E. D. Berger, T. Yang, T. Liu, and G. Novark. Grace: Safe Multithreaded Programming for C/C++. In OOPSLA, 2009.

Digital Library

[7]

B. Bershad, M. Zekauskas, and W. Sawdon. The Midway Distributed Shared Memory System. In Compcon Digest of Papers., 1993.

[8]

C. Bienia. Benchmarking Modern Multiprocessors. PhD thesis, Princeton University, 2011.

Digital Library

[9]

B. H. Bloom. Space/Time Trade-offs in Hash Coding with Allowable Errors. CACM, 13:422--426, 1970.

Digital Library

[10]

R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An Efficient Multithreaded Runtime System. In PPoPP, 1995.

Digital Library

[11]

R. Bocchino, Jr., V. Adve, D. Dig, S. Adve, S. Heumann, R. Komuravelli, J. Overbey, P. Simmons, H. Sung, and M. Vakilian. A Type and Effect System for Deterministic Parallel Java. In OOPSLA, 2009.

Digital Library

[12]

R. L. Bocchino, Jr., S. Heumann, N. Honarmand, S. V. Adve, V. S. Adve, A. Welc, and T. Shpeisman. Safe Nondeterminism in a Deterministic-by-Default Parallel Language. In POPL, 2011.

Digital Library

[13]

F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, and G. Varghese. An Improved Construction for Counting Bloom Filters. In ESA, 2006.

Digital Library

[14]

Z. Budimlić, M. Burke, V. Cavé, K. Knobe, G. Lowney, R. Newton, J. Palsberg, D. Peixotto, V. Sarkar, F. Schlimbach, and S. Taşirlar. Concurrent Collections. Sci. Program., 18(3--4), Aug. 2010.

Digital Library

[15]

J. L. Carter and M. N. Wegman. Universal classes of hash functions (extended abstract). In STOC, 1977.

Digital Library

[16]

L. Ceze, J. Tuck, J. Torrellas, and C. Cascaval. Bulk Disambiguation of Speculative Threads in Multiprocessors. In ISCA, 2006.

Digital Library

[17]

B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, S. V. Adve, V. S. Adve, N. P. Carter, and C.-T. Chou. DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism. In PACT, 2011.

Digital Library

[18]

J. Devietti, B. Lucia, L. Ceze, and M. Oskin. DMP: Deterministic Shared Memory Multiprocessing. In ASPLOS, 2009.

Digital Library

[19]

A. Ghuloum, E. Sprangle, J. Fang, G. Wu, and X. Zhou. Ct: A Flexible Parallel Programming Model for Tera-Scale Architectures, 2007.

[20]

J. R. Goodman, M. K. Vernon, and P. J. Woest. Efficient Synchronization Primitives for Large-Scale Cache-Coherent Multiprocessors. In ASPLOS, 1989.

Digital Library

[21]

D. Hackenberg, D. Molka, and W. E. Nagel. Comparing Cache Architectures and Coherency Protocols on x86--64 Multicore SMP Systems. In MICRO. IEEE, 2009.

Digital Library

[22]

L. Iftode, J. P. Singh, and K. Li. Scope Consistency: A Bridge between Release Consistency and Entry Consistency. In SPAA, 1996.

Digital Library

[23]

Intel. The SCC Platform Overview, 2010.

[24]

A. Kagi, D. Burger, and J. R. Goodman. Efficient Synchronization: Let Them Eat QOLB. In ISCA, 1997.

Digital Library

[25]

S. Kaxiras and G. Keramidas. SARC Coherence: Scaling Directory Cache Coherence in Performance and Power. IEEE Micro, 2010.

Digital Library

[26]

S. Keckler, W. Dally, B. Khailany, M. Garland, and D. Glasco. GPUs and the Future of Parallel Computing. IEEE Micro, 31:7--17, 2011.

Digital Library

[27]

P. Keleher, A. L. Cox, and W. Zwaenepoel. Lazy Release Consistency for Software Distributed Shared Memory. In ISCA, 1992.

Digital Library

[28]

J. H. Kelm, D. R. Johnson, M. R. Johnson, N. C. Crago, W. Tuohy, A. Mahesri, S. S. Lumetta, M. I. Frank, and S. J. Patel. Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator. In ISCA, 2009.

Digital Library

[29]

J. H. Kelm, D. R. Johnson, W. Tuohy, S. S. Lumetta, and S. J. Patel. Cohesion: A Hybrid Memory Model for Accelerators. In ISCA, 2010.

Digital Library

[30]

M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and L. P. Chew. Optimistic Parallelism Requires Abstractions. In PLDI, 2007.

Digital Library

[31]

A. Lebeck and D. Wood. Dynamic Self-Invalidation: Reducing Coherence Overhead in Shared-Memory Multiprocessors. In ISCA, 1995.

Digital Library

[32]

E. A. Lee. The Problem with Threads. IEEE Computer, 39(5), 2006.

Digital Library

[33]

F. X. Lin, Z. Wang, R. LiKamWa, and L. Zhong. Reflex: Using Low-Power Processors in Smartphones without Knowing Them. In ASPLOS, 2012.

Digital Library

[34]

P. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A Full System Simulation Platform. Computer, 35:50--58, 2002.

Digital Library

[35]

M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet's General Execution-driven Multiprocessor Simulator (GEMS) Toolset. SIGARCH Computer Architecture News, 2005.

Digital Library

[36]

S. L. Min and J.-L. Baer. Design and Analysis of a Scalable Cache Coherence Scheme Based on Clocks and Timestamps. TPDS, 1992.

Digital Library

[37]

C. C. Minh, J. Chung, C. Kozyrakis, and K. Olukotun. STAMP: Stanford Transactional Applications for Multi-Processing. In IISWC, 2008.

[38]

M. Mitzenmacher. Compressed Bloom Filters. In PODC, 2001.

Digital Library

[39]

M. Olszewski, J. Ansel, and S. Amarasinghe. Kendo: Efficient Deterministic Multithreading in Software. In ASPLOS, 2009.

Digital Library

[40]

Oracle. Java Language and Virtual Machine Specifications.

[41]

S. H. Pugsley, J. B. Spjut, D. W. Nellans, and R. Balasubramonian. SWEL: Hardware Cache Coherence Protocols to Map Shared Data onto Shared Caches. In PACT, 2010.

Digital Library

[42]

D. Sanchez, L. Yen, M. D. Hill, and K. Sankaralingam. Implementing Signatures for Transactional Memory. In MICRO, 2007.

Digital Library

[43]

A. Shriraman, S. Dwarkadas, and M. L. Scott. Flexible Decoupled Transactional Memory Support. In ISCA, 2008.

Digital Library

[44]

D. Vantrease, M. H. Lipasti, and N. Binkert. Atomic Coherence: Leveraging Nanophotonics to Build Race-Free Cache Coherence Protocols. In HPCA, 2011.

Digital Library

[45]

S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations. In ISCA, 1995.

Digital Library

[46]

L. Yen, J. Bobba, M. R. Marty, K. E. Moore, H. Volos, M. D. Hill, M. M. Swift, and D. A. Wood. LogTM-SE: Decoupling Hardware Transactional Memory from Caches. In HPCA, 2007.

Digital Library

Cited By

Wilkins MWestrick SKandiah VBernat ASuchy BDeiana ECampanoni SAcar UDinda PHardavellas NDubach CBruening DHardekopf B(2023)WARDen: Specializing Cache Coherence for High-Level Parallel LanguagesProceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3579990.3580013(122-135)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3579990.3580013
Chaudhuri M(2021)Zero Directory Eviction Victim: Unbounded Coherence Directory and Core Cache Isolation2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00032(277-290)Online publication date: Feb-2021
https://doi.org/10.1109/HPCA51647.2021.00032
Giannoula CVijaykumar NPapadopoulou NKarakostas VFernandez IGomez-Luna JOrosa LKoziris NGoumas GMutlu O(2021)SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00031(263-276)Online publication date: Feb-2021
https://doi.org/10.1109/HPCA51647.2021.00031
Show More Cited By

Index Terms

DeNovoND: efficient hardware support for disciplined non-determinism
1. Computer systems organization
  1. Architectures
    1. Parallel architectures

Recommendations

DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations
ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems

Current shared-memory hardware is complex and inefficient. Prior work on the DeNovo coherence protocol showed that disciplined shared-memory programming models can enable more complexity-, performance-, and energy-efficient hardware than the state-of-...
DeNovoND: efficient hardware support for disciplined non-determinism
ASPLOS '13

Recent work has shown that disciplined shared-memory programming models that provide deterministic-by-default semantics can simplify both parallel software and hardware. Specifically, the DeNovo hardware system has shown that the software guarantees of ...
DeNovoND: efficient hardware support for disciplined non-determinism
ASPLOS '13

Recent work has shown that disciplined shared-memory programming models that provide deterministic-by-default semantics can simplify both parallel software and hardware. Specifically, the DeNovo hardware system has shown that the software guarantees of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems

March 2013

574 pages

ISBN:9781450318709

DOI:10.1145/2451116

General Chair:
Vivek Sarkar
Rice University, USA
,
Program Chair:
Rastislav Bodik
University of California, Berkeley, USA

ACM SIGARCH Computer Architecture News Volume 41, Issue 1
ASPLOS '13
March 2013
540 pages
ISSN:0163-5964
DOI:10.1145/2490301
Issue’s Table of Contents
ACM SIGPLAN Notices Volume 48, Issue 4
ASPLOS '13
April 2013
540 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2499368
Issue’s Table of Contents

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 March 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ASPLOS '13

Sponsor:

ASPLOS '13: Architectural Support for Programming Languages and Operating Systems

March 16 - 20, 2013

Texas, Houston, USA

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

46
Total Citations
View Citations
474
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)2

Reflects downloads up to 01 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wilkins MWestrick SKandiah VBernat ASuchy BDeiana ECampanoni SAcar UDinda PHardavellas NDubach CBruening DHardekopf B(2023)WARDen: Specializing Cache Coherence for High-Level Parallel LanguagesProceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3579990.3580013(122-135)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3579990.3580013
Chaudhuri M(2021)Zero Directory Eviction Victim: Unbounded Coherence Directory and Core Cache Isolation2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00032(277-290)Online publication date: Feb-2021
https://doi.org/10.1109/HPCA51647.2021.00032
Giannoula CVijaykumar NPapadopoulou NKarakostas VFernandez IGomez-Luna JOrosa LKoziris NGoumas GMutlu O(2021)SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00031(263-276)Online publication date: Feb-2021
https://doi.org/10.1109/HPCA51647.2021.00031
Wang MTa TCheng LBatten C(2020)Efficiently Supporting Dynamic Task Parallelism on Heterogeneous Cache-Coherent Systems2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA45697.2020.00025(173-186)Online publication date: May-2020
https://doi.org/10.1109/ISCA45697.2020.00025
Biswas SZhang RBond MLucia B(2019)Rethinking Support for Region Conflict Exceptions2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2019.00116(1095-1106)Online publication date: May-2019
https://doi.org/10.1109/IPDPS.2019.00116
Caheny PAlvarez LValero MMoretó MCasas M(2018)Runtime-assisted cache coherence deactivation in task parallel programsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291703(1-12)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.5555/3291656.3291703
Jimborean AEkemark PWaern JKaxiras SRos A(2018)Automatic Detection of Large Extended Data-Race-Free Regions with Conflict IsolationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.277150929:3(527-541)Online publication date: 1-Mar-2018
https://doi.org/10.1109/TPDS.2017.2771509
Caheny PAlvarez LValero MMoretó MCasas M(2018)Runtime-assisted cache coherence deactivation in task parallel programsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC.2018.00038(1-12)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.1109/SC.2018.00038
Alsop JSinclair MAdve S(2018)SpandexProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00031(261-274)Online publication date: 2-Jun-2018
https://dl.acm.org/doi/10.1109/ISCA.2018.00031
He XMa SLiu WFan SHuang LWang ZZhou Z(2018)VISU: A Simple and Efficient Cache Coherence Protocol Based on Self-updatingAlgorithms and Architectures for Parallel Processing10.1007/978-3-030-05063-4_27(341-357)Online publication date: 7-Dec-2018
https://doi.org/10.1007/978-3-030-05063-4_27
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents