research-article

Lock Cohorting: A General Technique for Designing NUMA Locks

Authors:

Virendra J. Marathe,

Nir ShavitAuthors Info & Claims

ACM Transactions on Parallel Computing (TOPC), Volume 1, Issue 2

Article No.: 13, Pages 1 - 42

https://doi.org/10.1145/2686884

Published: 18 February 2015 Publication History

Abstract

Multicore machines are quickly shifting to NUMA and CC-NUMA architectures, making scalable NUMA-aware locking algorithms, ones that take into account the machine's nonuniform memory and caching hierarchy, ever more important. This article presents lock cohorting, a general new technique for designing NUMA-aware locks that is as simple as it is powerful.

Lock cohorting allows one to transform any spin-lock algorithm, with minimal nonintrusive changes,into a scalable NUMA-aware spin-lock. Our new cohorting technique allows us to easily create NUMA-aware versions of the TATAS-Backoff, CLH, MCS, and ticket locks, to name a few. Moreover, it allows us to derive a CLH-based cohort abortable lock, the first NUMA-aware queue lock to support abortability.

We empirically compared the performance of cohort locks with prior NUMA-aware and classic NUMA-oblivious locks on a synthetic micro-benchmark, a real world key-value store application memcached, as well as the libc memory allocator. Our results demonstrate that cohort locks perform as well or better than known locks when the load is low and significantly out-perform them as the load increases.

References

[1]

A. Agarwal and M. Cheritan. 1989. Adaptive backoff synchronization techniques. SIGARCH Comput. Archit. News 17, 3, 396--406.

Digital Library

[2]

AMD. 2012. AMD64 Architecture Programmer's Manual: Vol. 2 System Programming. http://support.amd.com/us/Embedded_TechDocs/24593.pdf.

[3]

T. E. Anderson. 1990. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst. 1, 1, 6--16.

Digital Library

[4]

Silas Boyd-Wickizer, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. 2012. Non-scalable locks are dangerous. In Proceedings of the Linux Symposium.

[5]

Irina Calciu, Dave Dice, Tim Harris, Maurice Herlihy, Alex Kogan, Virendra J. Marathe, and Mark Moir. 2013a. Message passing or shared memory: Evaluating the delegation abstraction for multicores. In Proceedings of the 17th International Conference on Principles of Distributed Systems. Roberto Baldoni, Nicolas Nisse, and Maarten van Steen, Eds., Lecture Notes in Computer Science, vol. 8304, Springer, 83--97. http://dx.doi.org/10.1007/978-3-319-03850-6_7.

Digital Library

[6]

Irina Calciu, Dave Dice, Yossi Lev, Victor Luchangco, Virendra J. Marathe, and Nir Shavit. 2013b. NUMA-aware reader-writer locks. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'13). ACM, New York, 157--166.

Digital Library

[7]

Pat Conway, Nathan Kalyanasundharam, Gregg Donley, Kevin Lepak, and Bill Hughes. 2010. Cache hierarchy and memory subsystem of the AMD Opteron processor. IEEE Micro 30, 2, 16--29.

Digital Library

[8]

Travis Craig. 1993. Building FIFO and priority-queueing spin locks from atomic swap. Tech. Rep. TR 93-02-02. Department of Computer Science, University of Washington.

[9]

Tudor David, Rachid Guerraoui, and Vasileios Trigonakis. 2013. Everything you always wanted to know about synchronization but were afraid to ask. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP'13). ACM, New York, 33--48.

Digital Library

[10]

David Dice. 2003. US Patent # 07318128: Wakeup affinity and locality. http://patft.uspto.gov/netacgi/nph-Parser?patentnumber=7318128.

[11]

David Dice. 2011a. Atomic fetch and add vs CAS. (2011). https://blogs.oracle.com/dave/entry/atomic_fetch_and_add_vs.

[12]

David Dice. 2011b. Brief announcement: a partitioned ticket lock. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA'11). ACM, New York, 309--310.

Digital Library

[13]

David Dice. 2011c. Polite busy-waiting with WRPAUSE on SPARC. https://blogs.oracle.com/dave/entry/polite_busy_waiting_with_wrpause.

[14]

David Dice. 2011d. Solaris Scheduling: SPARC and CPUIDs. (2011). https://blogs.oracle.com/dave/entry/solaris_scheduling_and_cpuids.

[15]

David Dice and Alex Garthwaite. 2002. Mostly lock-free malloc. In Proceedings of the 3rd International Symposium on Memory Management (ISMM'02). ACM, New York, 163--174.

Digital Library

[16]

David Dice, Virendra J. Marathe, and Nir Shavit. 2011. Flat-combining NUMA Locks. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA'11). ACM, New York, 65--74.

Digital Library

[17]

David Dice, Virendra J. Marathe, and Nir Shavit. 2012a. Lock cohorting: a general technique for designing NUMA locks. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'12). ACM, New York, 247--256.

Digital Library

[18]

David Dice, Nir Shavit, and Virendra J. Marathe. 2012b. US Patent Application 20130047011 - Turbo Enablement. http://www.google.com/patents/US20130047011.

[19]

David Dice, Nir Shavit, and Virendra J. Marathe. 2012c. US Patent US8694706 - Lock Cohorting. (2012). http://www.google.com/patents/US8694706.

[20]

Stijn Eyerman and Lieven Eeckhout. 2010. Modeling critical sections in Amdahl's law and it simplications for multicore design. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA'10). ACM, New York, 362--370.

Digital Library

[21]

Panagiota Fatourou and Nikolaos D. Kallimanis. 2012. Revisiting the combining synchronization technique. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'12). ACM, New York, 257--266.

Digital Library

[22]

Nitin Garg, Ed Zhu, and Fabiano C. Botelho. 2011. Light-weight locks. CoRR abs/1109.2638(2011). http://arxiv.org/abs/1109.2638.

[23]

J. R. Goodman and H. H. J. Hum. 2009. MESIF: A two-hop cache coherency protocol for point-to-point interconnects. https://researchspace.auckland.ac.nz/bitstream/handle/2292/11594/MESIF-2009.pdf.

[24]

Neil J. Gunther, Shanti Subramanyam, and Stefan Parvu. 2011. A methodology for optimizing multithreaded system scalability on multi-cores. CoRR abs/1105.4301 (2011).

[25]

Danny Hendler, Itai Incze, Nir Shavit, and Moran Tzafrir. 2010. Flat combining and the synchronization-parallelism tradeoff. In Proceedings of the 22nd ACM Symposium on Parallelism in Algorithms and Architectures. 355--364.

Digital Library

[26]

Maurice Herlihy and Nir Shavit. 2008. The Art of Multiprocessor Programming. Morgan Kaufmann.

Digital Library

[27]

Intel Corporation. 2009. An introduction to the Intel QuickPath Interconnect. http://www.intel.com/content/dam/doc/white-paper/quick-path-interconnect-introduction-paper.pdf. Document Number: 320412-001US.

[28]

F. Ryan Johnson, Radu Stoica, Anastasia Ailamaki, and Todd C. Mowry. 2010. Decoupling contention management from scheduling. In Proceedings of the 15th Conference on Architectural Support for Programming Languages and Operating System. ACM, New York, 117--128.

Digital Library

[29]

N. D. Kallimanis. 2013. Highly-Efficient synchronization techniques in shared-memory distributed systems. http://www.cs.uoi.gr/tech_reports//publications/PD-2013-2.pdf.

[30]

David Klaftenegger, Konstantinos Sagonas, and Kjell Winblad. 2014. Queue Delegation Locking. (2014). http://www.it.uu.se/research/group/languages/software/qd_lock_lib.

[31]

libmemcached.org. 2013. libmemcached. www.libmemcached.org.

[32]

Jean-Pierre Lozi, Florian David, Gaël Thomas, Julia Lawall, and Gilles Muller. 2012. Remote core locking: migrating critical-section execution to improve the performance of multithreaded applications. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC'12). USENIX Association, Berkeley, CA, 6--6. http://dl.acm.org/citation.cfm?id=2342821.2342827.

Digital Library

[33]

Victor Luchangco, Dan Nussbaum, and Nir Shavit. 2006. A Hierarchical CLH Queue Lock. In Proceedings of the 12th International Euro-Par Conference. 801--810.

Digital Library

[34]

P. Magnussen, A. Landin, and E. Hagersten. 1994. Queue locks on cache coherent multiprocessors. In Proceedings of the 8th International Symposium on Parallel Processing. 165--171.

Digital Library

[35]

John Mellor-Crummey and Michael L. Scott. 1991. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Computer Syst. 9, 1, 21--65.

Digital Library

[36]

memcached.org. 2013. memcached -- a distributed memory object caching system. www.memcached.org. (2013).

[37]

Avi Mendelson and Freddy Gabbay. 2001. The effect of seance communication on multiprocessing systems. ACM Trans. Comput. Syst. 19, 2, 252--281.

Digital Library

[38]

Oracle Corporation. 2010. Oracle's Sun Fire X4800 server architecture. http://www.oracle.com/technetwork/articles/systems-hardware-architecture/sf4800g5-architecture-163848.pdf.

[39]

Oracle Corporation. 2012. Oracle's SPARC T4-1, SPARC T4-2, SPARC T4-4, and SPARC T4-1B server architecture. http://www.oracle.com/technetwork/server-storage/sun-sparc-enterprise/documentation/o11-090-sparc-t4-arch-496245.pdf.

[40]

Y. Oyama, K. Taura, and A. Yonezawa. 1999. Executing parallel programs with synchronization bottlenecks efficiently. In Proceedings of the International Workshop on Parallel and Distributed Computing For Symbolic And Irregular Applications (PDSIA'99). World Scientific,182--204.

[41]

Mark S. Papamarcos and Janak H. Patel. 1984. A low-overhead coherence solution for multiprocessors with private cache memories. In Proceedings of the 11th Annual International Symposium on Computer Architecture (ISCA'84). ACM, New York, 348--354.

Digital Library

[42]

Martin Pohlack and Stephan Diestelhorst. 2011. From lightweight hardware transactional memory to lightweight lock elision. In Proceedings of the 6th ACM SIGPLAN Workshop on Transactional Computing.

[43]

Zoran Radović and Erik Hagersten. 2003. Hierarchical backoff locks for nonuniform communication architectures. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture. 241--252.

Digital Library

[44]

Arun Raghavan, Yixin Luo, Anuj Chandawalla, Marios Papaefthymiou, Kevin P. Pipe, Thomas F. Wenisch, and Milo M. K. Martin. 2012. Computational sprinting. In Proceedings of the IEEE 18th International Symposium on High-Performance Computer Architecture (HPCA'12). IEEE, 1--12.

Digital Library

[45]

Michael L. Scott. 2002. Non-blocking timeout in scalable queue-based spin locks. In Proceedings of the 21st Annual Symposium on Principles of Distributed Computing (PODC'02). ACM, New York, 31--40.

Digital Library

[46]

Michael L. Scott. 2013. Shared-memory synchronization. Synthesis Lectures Comput. Architec. 8, 2, 1--221.

Digital Library

[47]

Michael L. Scott and William Scherer. 2001. Scalable queue-based spin locks with timeout. In Proceedings of the 8th ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming. 44--52.

Digital Library

[48]

Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O. Myreen. 2010. X86-TSO: A rigorous and usable programmer's model for x86 multiprocessors. Commun. ACM 53,7, 89--97.

Digital Library

[49]

Daniel Dominic Sleator and Robert Endre Tarjan. 1985. Self-adjusting binary search trees. J. ACM 32, 3, 652--686.

Digital Library

[50]

M. Aater Suleman, Onur Mutlu, Moinuddin K. Qureshi, and Yale N. Patt. 2009. Accelerating critical section execution with asymmetric multi-core architectures. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, 253--264.

Digital Library

[51]

P. Sweazey and A. J. Smith. 1986. A class of compatible cache consistency protocols and their support by the IEEE futurebus. In Proceedings of the 13th Annual International Symposium on Computer Architecture (ISCA'86). IEEE, 414--423. http://dl.acm.org/citation.cfm?id=17407.17404.

Digital Library

[52]

Trilok Vyas, Yujie Liu, and Michael Spear. 2013. Transactionalizing legacy code: An experience report using GCC and memcached. In Proceedings of the 8th ACM SIGPLAN Workshop on Transactional Computing.

[53]

Wikipedia. 2014a. Closure (computer programming). http://en.wikipedia.org/wiki/Closure_(computer_programming).

[54]

Wikipedia. 2014b. Futures and promises. http://en.wikipedia.org/wiki/Futures_and_promises.

[55]

Benlong Zhang, Junbin Kang, Tianyu Wo, Yuda Wang, and Renyu Yang. 2014. A flexible and scalable affinity lock for the kernel. In Proceedings of the 16th IEEE International Conference on High Performance Computing and Communications (HPCC'14).

Digital Library

Cited By

Cai BYao JLi MXiao X(2023)A Scalable Adaptive Locking Mechanism for High Performance Computing2023 4th International Conference on Computer Vision, Image and Deep Learning (CVIDL)10.1109/CVIDL58838.2023.10166039(651-658)Online publication date: 12-May-2023
https://doi.org/10.1109/CVIDL58838.2023.10166039
Wang QLu YLi JXie MShu J(2022)Nap: Persistent Memory Indexes for NUMA ArchitecturesACM Transactions on Storage10.1145/350792218:1(1-35)Online publication date: 29-Jan-2022
https://dl.acm.org/doi/10.1145/3507922
Fahmy AGolab WAgrawal KLee I(2022)A NUMA-Aware Recoverable Mutex LockProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3490148.3538594(295-305)Online publication date: 11-Jul-2022
https://dl.acm.org/doi/10.1145/3490148.3538594
Show More Cited By

Index Terms

Lock Cohorting: A General Technique for Designing NUMA Locks
1. Computing methodologies
  1. Concurrent computing methodologies
    1. Concurrent programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Concurrent programming languages

Recommendations

High performance locks for multi-level NUMA systems
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Efficient locking mechanisms are critically important for high performance computers. On highly-threaded systems with a deep memory hierarchy, the throughput of traditional queueing locks, e.g., MCS locks, falls off due to NUMA effects. Two-level ...
Lock cohorting: a general technique for designing NUMA locks
PPOPP '12

Multicore machines are quickly shifting to NUMA and CC-NUMA architectures, making scalable NUMA-aware locking algorithms, ones that take into account the machines' non-uniform memory and caching hierarchy, ever more important. This paper presents lock ...
Lock cohorting: a general technique for designing NUMA locks
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming

Multicore machines are quickly shifting to NUMA and CC-NUMA architectures, making scalable NUMA-aware locking algorithms, ones that take into account the machines' non-uniform memory and caching hierarchy, ever more important. This paper presents lock ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing

ACM Transactions on Parallel Computing Volume 1, Issue 2

Special Issue on PPOPP 2012

January 2015

224 pages

ISSN:2329-4949

EISSN:2329-4957

DOI:10.1145/2737841

Editor:
Phillip B. Gibbons
Intel Labs, Pittsburgh, USA

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 February 2015

Accepted: 01 September 2014

Revised: 01 July 2014

Received: 01 April 2013

Published in TOPC Volume 1, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Intel
DoE ASCR
NSF
Oracle

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

40
Total Citations
View Citations
615
Total Downloads

Downloads (Last 12 months)38
Downloads (Last 6 weeks)2

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Cai BYao JLi MXiao X(2023)A Scalable Adaptive Locking Mechanism for High Performance Computing2023 4th International Conference on Computer Vision, Image and Deep Learning (CVIDL)10.1109/CVIDL58838.2023.10166039(651-658)Online publication date: 12-May-2023
https://doi.org/10.1109/CVIDL58838.2023.10166039
Wang QLu YLi JXie MShu J(2022)Nap: Persistent Memory Indexes for NUMA ArchitecturesACM Transactions on Storage10.1145/350792218:1(1-35)Online publication date: 29-Jan-2022
https://dl.acm.org/doi/10.1145/3507922
Fahmy AGolab WAgrawal KLee I(2022)A NUMA-Aware Recoverable Mutex LockProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3490148.3538594(295-305)Online publication date: 11-Jul-2022
https://dl.acm.org/doi/10.1145/3490148.3538594
Park JEom Y(2022)Weight-Aware Cache for Application-Level Proportional I/O SharingIEEE Transactions on Computers10.1109/TC.2021.312936671:10(2395-2407)Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.1109/TC.2021.3129366
Ouyang XZhu Y(2022)Core-aware combiningJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.01.001162:C(27-43)Online publication date: 1-Apr-2022
https://dl.acm.org/doi/10.1016/j.jpdc.2022.01.001
Yi ZYao YChen KGunawi HMa X(2021)FTSDProceedings of the 12th ACM SIGOPS Asia-Pacific Workshop on Systems10.1145/3476886.3477518(123-130)Online publication date: 24-Aug-2021
https://dl.acm.org/doi/10.1145/3476886.3477518
Oberhauser JChehab RBehrens DFu MPaolillo AOberhauser LBhat KWen YChen HKim JVafeiadis VSherwood TBerger EKozyrakis C(2021)VSync: push-button verification and optimization for synchronization primitives on weak memory modelsProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446748(530-545)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3445814.3446748
Xu YIzraelevitz JSwanson SSherwood TBerger EKozyrakis C(2021)Clobber-NVM: log less, re-execute moreProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446730(346-359)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3445814.3446730
Dice DKogan AAgrawal KAzar Y(2021)HemlockProceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3409964.3461805(173-183)Online publication date: 6-Jul-2021
https://dl.acm.org/doi/10.1145/3409964.3461805
Giannoula CVijaykumar NPapadopoulou NKarakostas VFernandez IGomez-Luna JOrosa LKoziris NGoumas GMutlu O(2021)SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00031(263-276)Online publication date: Feb-2021
https://doi.org/10.1109/HPCA51647.2021.00031
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents