research-article

Lease/Release: Architectural Support for Scaling Contended Data Structures

Authors:

Syed Kamran Haider,

William Hasenplaugh,

Dan AlistarhAuthors Info & Claims

ACM Transactions on Parallel Computing (TOPC), Volume 4, Issue 2

Article No.: 8, Pages 1 - 25

https://doi.org/10.1145/3132168

Published: 10 October 2017 Publication History

Abstract

High memory contention is generally agreed to be a worst-case scenario for concurrent data structures. There has been a significant amount of research effort spent investigating designs that minimize contention, and several programming techniques have been proposed to mitigate its effects. However, there are currently few architectural mechanisms to allow scaling contended data structures at high thread counts.

In this article, we investigate hardware support for scalable contended data structures. We propose Lease/Release, a simple addition to standard directory-based MESI cache coherence protocols, allowing participants to lease memory, at the granularity of cache lines, by delaying coherence messages for a short, bounded period of time. Our analysis shows that Lease/Release can significantly reduce the overheads of contention for both non-blocking (lock-free) and lock-based data structure implementations while ensuring that no deadlocks are introduced. We validate Lease/Release empirically on the Graphite multiprocessor simulator on a range of data structures, including queue, stack, and priority queue implementations, as well as on transactional applications. Results show that Lease/Release consistently improves both throughput and energy usage, by up to 5x, both for lock-free and lock-based data structure designs.

References

[1]

Yehuda Afek, Michael Hakimi, and Adam Morrison. 2013. Fast and scalable rendezvousing. Distributed Computing 26, 243--269

Digital Library

[2]

Masab Ahmad, Farrukh Hijaz, Qingchuan Shi, and Omer Khan. 2015. CRONO: A benchmark suite for multithreaded graph algorithms executing on futuristic multicores. In Proceedings of the 2015 IEEE International Symposium on Workload Characterization (IISWC’15). IEEE, Los Alamitos, CA, 44--55.

Digital Library

[3]

Dan Alistarh, James Aspnes, Keren Censor-Hillel, Seth Gilbert, and Rachid Guerraoui. 2014. Tight bounds for asynchronous renaming. Journal of the ACM 61, 3, 18.

Digital Library

[4]

Dan Alistarh, Justin Kopinsky, Jerry Li, and Nir Shavit. 2015. The SprayList: A scalable relaxed priority queue. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’15). ACM, New York, NY, 11--20.

Digital Library

[5]

Byn Choi, Rakesh Komuravelli, Hyojin Sung, Robert Smolinski, Nima Honarmand, Sarita V. Adve, Vikram S. Adve, Nicholas P. Carter, and Ching-Tsun Chou. 2011. DeNovo: Rethinking the memory hierarchy for disciplined parallelism. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT’11). IEEE, Los Alamitos, CA, 155--166.

Digital Library

[6]

Travis Craig. 1994. Building FIFO and Priority-Queuing Spin Locks From Atomic Swap. Technical Report 93-02-02, University of Washington, Seattle.

[7]

Tyler Crain, Vincent Gramoli, and Michel Raynal. 2012. A speculation-friendly binary search tree. ACM SIGPLAN Notices 47, 8, 161--170.

Digital Library

[8]

Tudor David, Rachid Guerraoui, and Vasileios Trigonakis. 2013. Everything you always wanted to know about synchronization but were afraid to ask. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. ACM, New York, NY, 33--48.

Digital Library

[9]

Tudor David, Rachid Guerraoui, and Vasileios Trigonakis. 2015. Asynchronized concurrency: The secret to scaling concurrent search data structures. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, NY, 631--644.

Digital Library

[10]

David Dice, Danny Hendler, and Ilya Mirsky. 2013. Lightweight contention management for efficient compare-and-swap operations. In Euro-Par 2013 Parallel Processing. Springer, 595--606.

Digital Library

[11]

David Dice, Virendra J. Marathe, and Nir Shavit. 2015. Lock cohorting: A general technique for designing NUMA locks. ACM Transactions on Parallel Computing 1, 2 (February 2015), Article 13, 42 pages.

Digital Library

[12]

Dave Dice, Ori Shalev, and Nir Shavit. 2006. Transactional locking II. In Distributed Computing. Springer, 194--208.

Digital Library

[13]

Faith Ellen, Panagiota Fatourou, Eric Ruppert, and Franck van Breugel. 2010. Non-blocking binary search trees. In Proceedings of the 29th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing (PODC’10). ACM, New York, NY, 131--140.

Digital Library

[14]

Faith Ellen, Danny Hendler, and Nir Shavit. 2012. On the inherent sequentiality of concurrent objects. SIAM Journal on Computing 41, 3, 519--536.

[15]

Panagiota Fatourou and Nikolaos D. Kallimanis. 2011. A highly-efficient wait-free universal construction. In Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures. ACM, New York, NY, 325--334.

Digital Library

[16]

Keir Fraser. 2004. Practical Lock-Freedom. Ph.D. Dissertation. Cambridge University Computer Laboratory, Cambridge, UK. Also available as Technical Report UCAM-CL-TR-579.

[17]

James R. Goodman, Mary K. Vernon, and Philip J. Woest. 1989. Efficient synchronization primitives for large-scale cache-coherent multiprocessors. SIGARCH Computer Architecture News 17, 2, 64--75.

Digital Library

[18]

Timothy L. Harris. 2001. A pragmatic implementation of non-blocking linked-lists. In Proceedings of the 15th International Conference on Distributed Computing (DISC’01). Springer, 300--314. http://dl.acm.org/citation.cfm?id=645958.676105

[19]

Danny Hendler, Itai Incze, Nir Shavit, and Moran Tzafrir. 2010. Flat combining and the synchronization-parallelism tradeoff. In Proceedings of the 22nd Annual ACM Symposium on Parallelism in Algorithms and Architectures. ACM, New York, NY, 355--364.

Digital Library

[20]

Thomas A. Henzinger, Christoph M. Kirsch, Hannes Payer, Ali Sezgin, and Ana Sokolova. 2013. Quantitative relaxation of concurrent data structures. In Proceedings of the 40th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL’13). ACM, New York, NY, 317--328.

Digital Library

[21]

Maurice Herlihy and Nir Shavit. 2008. The Art of Multiprocessor Programming. Morgan Kaufmann.

[22]

Alain Kägi, Doug Burger, and James R. Goodman. 1997. Efficient synchronization: Let them eat QOLB. SIGARCH Computer Architecture News 25, 2, 170--180.

Digital Library

[23]

Charles Leiserson. 2015. A simple deterministic algorithm for guaranteeing the forward progress of transactions. In Proceedings of the 10th ACM SIGPLAN Workshop on Transactional Computing (TRANSACT’15).

[24]

Itay Lotan and Nir Shavit. 2000. Skiplist-based concurrent priority queues. In Proceedings of the 14th International Parallel and Distributed Processing Symposium (IPDPS’00). IEEE, Los Alamitos, CA, 263--268.

[25]

Peter Magnusson, Anders Landin, and Erik Hagersten. 1994. Queue locks on cache coherent multiprocessors. In Proceedings of the 8th International Parallel Processing Symposium. IEEE, Los Alamitos, CA, 165--171.

[26]

John M. Mellor-Crummey and Michael L. Scott. 1991. Synchronization without contention. ACM SIGPLAN Notices 26, 4, 269--278.

Digital Library

[27]

Maged M. Michael. 2002. High performance dynamic lock-free hash tables and list-based sets. In Proceedings of the 14th Annual ACM Symposium on Parallel Algorithms and Architectures. ACM, New York, NY, 73--82.

Digital Library

[28]

Maged M. Michael and Michael L. Scott. 1996. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the 15th Annual ACM Symposium on Principles of Distributed Computing (PODC’96). ACM, New York, NY, 267--275.

Digital Library

[29]

Jason E. Miller, Harshad Kasture, George Kurian, Charles Gruenwald III, Nathan Beckmann, Christopher Celio, Jonathan Eastep, and Anant Agarwal. 2010. Graphite: A distributed parallel simulator for multicores. In Proceedings of the 2010 IEEE 16th International Symposium on High Performance Computer Architecture (HPCA’10). IEEE, Los Alamitos, CA, 1--12.

[30]

Adam Morrison and Yehuda Afek. 2013. Fast concurrent queues for x86 processors.ACM SIGPLAN Notices 48, 103--112.

Digital Library

[31]

Takuya Nakaike, Rei Odaira, Matthew Gaudet, Maged M. Michael, and Hisanobu Tomari. 2015. Quantitative comparison of hardware transactional memory for blue gene/Q, zEnterprise EC12, Intel core, and POWER8. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA’15). ACM, New York, NY, 144--157.

Digital Library

[32]

Aravind Natarajan and Neeraj Mittal. 2014. Fast concurrent lock-free binary search trees. ACM SIGPLAN Notices 49, 317--328.

Digital Library

[33]

Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry C. Li, Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab, David Stafford, Tony Tung, and Venkateshwaran Venkataramani. 2013. Scaling memcache at Facebook. In Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI’13). 385--398. https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/nishtala.

Digital Library

[34]

William Pugh. 1998. Concurrent Maintenance of Skip Lists. Technical Report. University of Maryland at College Park, College Park, MD.

Digital Library

[35]

Ravi Rajwar, Alain Kagi, and James R. Goodman. 2000. Improving the throughput of synchronization by insertion of delays. In Proceedings of the 6th International Symposium on High Performance Computer Architecture (HPCA’00). IEEE, Los Alamitos, CA, 168--179.

[36]

Ravi Rajwar, Alain Kägi, and James R. Goodman. 2003. Inferential queueing and speculative push for reducing critical communication latencies. In Proceedings of the 17th Annual International Conference on Supercomputing (ICS’03). ACM, New York, NY, 273--284.

Digital Library

[37]

Hamza Rihani, Peter Sanders, and Roman Dementiev. 2015. Brief announcement. MultiQueues: Simple relaxed concurrent priority queues. In Proceedings of the 27th ACM on Symposium on Parallelism in Algorithms and Architectures (SPAA’15). ACM, New York, NY, 80--82.

Digital Library

[38]

Michael L. Scott. 2013. Shared-Memory Synchronization. Morgan 8 Claypool.

[39]

Ori Shalev and Nir Shavit. 2005. Transient Blocking Synchronization. Technical Report. Mountain View, CA.

[40]

Nir Shavit and Dan Touitou. 1995. Elimination trees and the construction of pools and stacks: Preliminary version. In Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures. ACM, New York, NY, 54--63.

Digital Library

[41]

Daniel J. Sorin, Mark D. Hill, and David A. Wood. 2011. A Primer on Memory Consistency and Cache Coherence. Morgan 8 Claypool.

[42]

R. K. Treiber. 1986. Systems Programming: Coping with Parallelism. Technical Report RJ 5118. IBM Almaden Research Center, San Jose, CA.

[43]

Xiangyao Yu and Srinivas Devadas. 2015. TARDIS: Timestamp based coherence algorithm for distributed shared memory. arXiv:1501.04504.

Index Terms

Lease/Release: Architectural Support for Scaling Contended Data Structures

Recommendations

Lease/release: architectural support for scaling contended data structures
PPoPP '16

High memory contention is generally agreed to be a worst-case scenario for concurrent data structures. There has been a significant amount of research effort spent investigating designs which minimize contention, and several programming techniques have ...
ThreadScan: Automatic and Scalable Memory Reclamation
Special Issue on SPAA 2015

The concurrent memory reclamation problem is that of devising a way for a deallocating thread to verify that no other concurrent threads hold references to a memory block being deallocated. To date, in the absence of automatic garbage collection, there ...
Lease/release: architectural support for scaling contended data structures
PPoPP '16: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

High memory contention is generally agreed to be a worst-case scenario for concurrent data structures. There has been a significant amount of research effort spent investigating designs which minimize contention, and several programming techniques have ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing

ACM Transactions on Parallel Computing Volume 4, Issue 2

Special Issue: Invited papers from PPoPP 2016, Part 2

June 2017

154 pages

ISSN:2329-4949

EISSN:2329-4957

DOI:10.1145/3134419

Editor:
Phillip B. Gibbons
Carnegie Mellon University, Pittsburgh, USA

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2017

Accepted: 01 August 2017

Revised: 01 July 2017

Received: 01 January 2017

Published in TOPC Volume 4, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Swiss National Fund Ambizione Fellowship

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
114
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)1

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents