Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Lock Cohorting: A General Technique for Designing NUMA Locks

Published: 18 February 2015 Publication History

Abstract

Multicore machines are quickly shifting to NUMA and CC-NUMA architectures, making scalable NUMA-aware locking algorithms, ones that take into account the machine's nonuniform memory and caching hierarchy, ever more important. This article presents lock cohorting, a general new technique for designing NUMA-aware locks that is as simple as it is powerful.
Lock cohorting allows one to transform any spin-lock algorithm, with minimal nonintrusive changes,into a scalable NUMA-aware spin-lock. Our new cohorting technique allows us to easily create NUMA-aware versions of the TATAS-Backoff, CLH, MCS, and ticket locks, to name a few. Moreover, it allows us to derive a CLH-based cohort abortable lock, the first NUMA-aware queue lock to support abortability.
We empirically compared the performance of cohort locks with prior NUMA-aware and classic NUMA-oblivious locks on a synthetic micro-benchmark, a real world key-value store application memcached, as well as the libc memory allocator. Our results demonstrate that cohort locks perform as well or better than known locks when the load is low and significantly out-perform them as the load increases.

References

[1]
A. Agarwal and M. Cheritan. 1989. Adaptive backoff synchronization techniques. SIGARCH Comput. Archit. News 17, 3, 396--406.
[2]
AMD. 2012. AMD64 Architecture Programmer's Manual: Vol. 2 System Programming. http://support.amd.com/us/Embedded_TechDocs/24593.pdf.
[3]
T. E. Anderson. 1990. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst. 1, 1, 6--16.
[4]
Silas Boyd-Wickizer, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. 2012. Non-scalable locks are dangerous. In Proceedings of the Linux Symposium.
[5]
Irina Calciu, Dave Dice, Tim Harris, Maurice Herlihy, Alex Kogan, Virendra J. Marathe, and Mark Moir. 2013a. Message passing or shared memory: Evaluating the delegation abstraction for multicores. In Proceedings of the 17th International Conference on Principles of Distributed Systems. Roberto Baldoni, Nicolas Nisse, and Maarten van Steen, Eds., Lecture Notes in Computer Science, vol. 8304, Springer, 83--97. http://dx.doi.org/10.1007/978-3-319-03850-6_7.
[6]
Irina Calciu, Dave Dice, Yossi Lev, Victor Luchangco, Virendra J. Marathe, and Nir Shavit. 2013b. NUMA-aware reader-writer locks. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'13). ACM, New York, 157--166.
[7]
Pat Conway, Nathan Kalyanasundharam, Gregg Donley, Kevin Lepak, and Bill Hughes. 2010. Cache hierarchy and memory subsystem of the AMD Opteron processor. IEEE Micro 30, 2, 16--29.
[8]
Travis Craig. 1993. Building FIFO and priority-queueing spin locks from atomic swap. Tech. Rep. TR 93-02-02. Department of Computer Science, University of Washington.
[9]
Tudor David, Rachid Guerraoui, and Vasileios Trigonakis. 2013. Everything you always wanted to know about synchronization but were afraid to ask. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP'13). ACM, New York, 33--48.
[10]
David Dice. 2003. US Patent # 07318128: Wakeup affinity and locality. http://patft.uspto.gov/netacgi/nph-Parser?patentnumber=7318128.
[11]
David Dice. 2011a. Atomic fetch and add vs CAS. (2011). https://blogs.oracle.com/dave/entry/atomic_fetch_and_add_vs.
[12]
David Dice. 2011b. Brief announcement: a partitioned ticket lock. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA'11). ACM, New York, 309--310.
[13]
David Dice. 2011c. Polite busy-waiting with WRPAUSE on SPARC. https://blogs.oracle.com/dave/entry/polite_busy_waiting_with_wrpause.
[14]
David Dice. 2011d. Solaris Scheduling: SPARC and CPUIDs. (2011). https://blogs.oracle.com/dave/entry/solaris_scheduling_and_cpuids.
[15]
David Dice and Alex Garthwaite. 2002. Mostly lock-free malloc. In Proceedings of the 3rd International Symposium on Memory Management (ISMM'02). ACM, New York, 163--174.
[16]
David Dice, Virendra J. Marathe, and Nir Shavit. 2011. Flat-combining NUMA Locks. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA'11). ACM, New York, 65--74.
[17]
David Dice, Virendra J. Marathe, and Nir Shavit. 2012a. Lock cohorting: a general technique for designing NUMA locks. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'12). ACM, New York, 247--256.
[18]
David Dice, Nir Shavit, and Virendra J. Marathe. 2012b. US Patent Application 20130047011 - Turbo Enablement. http://www.google.com/patents/US20130047011.
[19]
David Dice, Nir Shavit, and Virendra J. Marathe. 2012c. US Patent US8694706 - Lock Cohorting. (2012). http://www.google.com/patents/US8694706.
[20]
Stijn Eyerman and Lieven Eeckhout. 2010. Modeling critical sections in Amdahl's law and it simplications for multicore design. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA'10). ACM, New York, 362--370.
[21]
Panagiota Fatourou and Nikolaos D. Kallimanis. 2012. Revisiting the combining synchronization technique. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'12). ACM, New York, 257--266.
[22]
Nitin Garg, Ed Zhu, and Fabiano C. Botelho. 2011. Light-weight locks. CoRR abs/1109.2638(2011). http://arxiv.org/abs/1109.2638.
[23]
J. R. Goodman and H. H. J. Hum. 2009. MESIF: A two-hop cache coherency protocol for point-to-point interconnects. https://researchspace.auckland.ac.nz/bitstream/handle/2292/11594/MESIF-2009.pdf.
[24]
Neil J. Gunther, Shanti Subramanyam, and Stefan Parvu. 2011. A methodology for optimizing multithreaded system scalability on multi-cores. CoRR abs/1105.4301 (2011).
[25]
Danny Hendler, Itai Incze, Nir Shavit, and Moran Tzafrir. 2010. Flat combining and the synchronization-parallelism tradeoff. In Proceedings of the 22nd ACM Symposium on Parallelism in Algorithms and Architectures. 355--364.
[26]
Maurice Herlihy and Nir Shavit. 2008. The Art of Multiprocessor Programming. Morgan Kaufmann.
[27]
Intel Corporation. 2009. An introduction to the Intel QuickPath Interconnect. http://www.intel.com/content/dam/doc/white-paper/quick-path-interconnect-introduction-paper.pdf. Document Number: 320412-001US.
[28]
F. Ryan Johnson, Radu Stoica, Anastasia Ailamaki, and Todd C. Mowry. 2010. Decoupling contention management from scheduling. In Proceedings of the 15th Conference on Architectural Support for Programming Languages and Operating System. ACM, New York, 117--128.
[29]
N. D. Kallimanis. 2013. Highly-Efficient synchronization techniques in shared-memory distributed systems. http://www.cs.uoi.gr/tech_reports//publications/PD-2013-2.pdf.
[30]
David Klaftenegger, Konstantinos Sagonas, and Kjell Winblad. 2014. Queue Delegation Locking. (2014). http://www.it.uu.se/research/group/languages/software/qd_lock_lib.
[31]
libmemcached.org. 2013. libmemcached. www.libmemcached.org.
[32]
Jean-Pierre Lozi, Florian David, Gaël Thomas, Julia Lawall, and Gilles Muller. 2012. Remote core locking: migrating critical-section execution to improve the performance of multithreaded applications. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC'12). USENIX Association, Berkeley, CA, 6--6. http://dl.acm.org/citation.cfm?id=2342821.2342827.
[33]
Victor Luchangco, Dan Nussbaum, and Nir Shavit. 2006. A Hierarchical CLH Queue Lock. In Proceedings of the 12th International Euro-Par Conference. 801--810.
[34]
P. Magnussen, A. Landin, and E. Hagersten. 1994. Queue locks on cache coherent multiprocessors. In Proceedings of the 8th International Symposium on Parallel Processing. 165--171.
[35]
John Mellor-Crummey and Michael L. Scott. 1991. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Computer Syst. 9, 1, 21--65.
[36]
memcached.org. 2013. memcached -- a distributed memory object caching system. www.memcached.org. (2013).
[37]
Avi Mendelson and Freddy Gabbay. 2001. The effect of seance communication on multiprocessing systems. ACM Trans. Comput. Syst. 19, 2, 252--281.
[38]
Oracle Corporation. 2010. Oracle's Sun Fire X4800 server architecture. http://www.oracle.com/technetwork/articles/systems-hardware-architecture/sf4800g5-architecture-163848.pdf.
[39]
Oracle Corporation. 2012. Oracle's SPARC T4-1, SPARC T4-2, SPARC T4-4, and SPARC T4-1B server architecture. http://www.oracle.com/technetwork/server-storage/sun-sparc-enterprise/documentation/o11-090-sparc-t4-arch-496245.pdf.
[40]
Y. Oyama, K. Taura, and A. Yonezawa. 1999. Executing parallel programs with synchronization bottlenecks efficiently. In Proceedings of the International Workshop on Parallel and Distributed Computing For Symbolic And Irregular Applications (PDSIA'99). World Scientific,182--204.
[41]
Mark S. Papamarcos and Janak H. Patel. 1984. A low-overhead coherence solution for multiprocessors with private cache memories. In Proceedings of the 11th Annual International Symposium on Computer Architecture (ISCA'84). ACM, New York, 348--354.
[42]
Martin Pohlack and Stephan Diestelhorst. 2011. From lightweight hardware transactional memory to lightweight lock elision. In Proceedings of the 6th ACM SIGPLAN Workshop on Transactional Computing.
[43]
Zoran Radović and Erik Hagersten. 2003. Hierarchical backoff locks for nonuniform communication architectures. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture. 241--252.
[44]
Arun Raghavan, Yixin Luo, Anuj Chandawalla, Marios Papaefthymiou, Kevin P. Pipe, Thomas F. Wenisch, and Milo M. K. Martin. 2012. Computational sprinting. In Proceedings of the IEEE 18th International Symposium on High-Performance Computer Architecture (HPCA'12). IEEE, 1--12.
[45]
Michael L. Scott. 2002. Non-blocking timeout in scalable queue-based spin locks. In Proceedings of the 21st Annual Symposium on Principles of Distributed Computing (PODC'02). ACM, New York, 31--40.
[46]
Michael L. Scott. 2013. Shared-memory synchronization. Synthesis Lectures Comput. Architec. 8, 2, 1--221.
[47]
Michael L. Scott and William Scherer. 2001. Scalable queue-based spin locks with timeout. In Proceedings of the 8th ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming. 44--52.
[48]
Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O. Myreen. 2010. X86-TSO: A rigorous and usable programmer's model for x86 multiprocessors. Commun. ACM 53,7, 89--97.
[49]
Daniel Dominic Sleator and Robert Endre Tarjan. 1985. Self-adjusting binary search trees. J. ACM 32, 3, 652--686.
[50]
M. Aater Suleman, Onur Mutlu, Moinuddin K. Qureshi, and Yale N. Patt. 2009. Accelerating critical section execution with asymmetric multi-core architectures. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, 253--264.
[51]
P. Sweazey and A. J. Smith. 1986. A class of compatible cache consistency protocols and their support by the IEEE futurebus. In Proceedings of the 13th Annual International Symposium on Computer Architecture (ISCA'86). IEEE, 414--423. http://dl.acm.org/citation.cfm?id=17407.17404.
[52]
Trilok Vyas, Yujie Liu, and Michael Spear. 2013. Transactionalizing legacy code: An experience report using GCC and memcached. In Proceedings of the 8th ACM SIGPLAN Workshop on Transactional Computing.
[53]
Wikipedia. 2014a. Closure (computer programming). http://en.wikipedia.org/wiki/Closure_(computer_programming).
[54]
Wikipedia. 2014b. Futures and promises. http://en.wikipedia.org/wiki/Futures_and_promises.
[55]
Benlong Zhang, Junbin Kang, Tianyu Wo, Yuda Wang, and Renyu Yang. 2014. A flexible and scalable affinity lock for the kernel. In Proceedings of the 16th IEEE International Conference on High Performance Computing and Communications (HPCC'14).

Cited By

View all
  • (2023)A Scalable Adaptive Locking Mechanism for High Performance Computing2023 4th International Conference on Computer Vision, Image and Deep Learning (CVIDL)10.1109/CVIDL58838.2023.10166039(651-658)Online publication date: 12-May-2023
  • (2022)Nap: Persistent Memory Indexes for NUMA ArchitecturesACM Transactions on Storage10.1145/350792218:1(1-35)Online publication date: 29-Jan-2022
  • (2022)A NUMA-Aware Recoverable Mutex LockProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3490148.3538594(295-305)Online publication date: 11-Jul-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing
ACM Transactions on Parallel Computing  Volume 1, Issue 2
Special Issue on PPOPP 2012
January 2015
224 pages
ISSN:2329-4949
EISSN:2329-4957
DOI:10.1145/2737841
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 February 2015
Accepted: 01 September 2014
Revised: 01 July 2014
Received: 01 April 2013
Published in TOPC Volume 1, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Concurrency
  2. NUMA
  3. hierarchical locks
  4. locks
  5. multicore
  6. mutex
  7. mutual exclusion
  8. spin locks

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Intel
  • DoE ASCR
  • NSF
  • Oracle

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)38
  • Downloads (Last 6 weeks)2
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2023)A Scalable Adaptive Locking Mechanism for High Performance Computing2023 4th International Conference on Computer Vision, Image and Deep Learning (CVIDL)10.1109/CVIDL58838.2023.10166039(651-658)Online publication date: 12-May-2023
  • (2022)Nap: Persistent Memory Indexes for NUMA ArchitecturesACM Transactions on Storage10.1145/350792218:1(1-35)Online publication date: 29-Jan-2022
  • (2022)A NUMA-Aware Recoverable Mutex LockProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3490148.3538594(295-305)Online publication date: 11-Jul-2022
  • (2022)Weight-Aware Cache for Application-Level Proportional I/O SharingIEEE Transactions on Computers10.1109/TC.2021.312936671:10(2395-2407)Online publication date: 1-Oct-2022
  • (2022)Core-aware combiningJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.01.001162:C(27-43)Online publication date: 1-Apr-2022
  • (2021)FTSDProceedings of the 12th ACM SIGOPS Asia-Pacific Workshop on Systems10.1145/3476886.3477518(123-130)Online publication date: 24-Aug-2021
  • (2021)VSync: push-button verification and optimization for synchronization primitives on weak memory modelsProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446748(530-545)Online publication date: 19-Apr-2021
  • (2021)Clobber-NVM: log less, re-execute moreProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446730(346-359)Online publication date: 19-Apr-2021
  • (2021)HemlockProceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3409964.3461805(173-183)Online publication date: 6-Jul-2021
  • (2021)SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00031(263-276)Online publication date: Feb-2021
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media