Abstract
Modern multi-core architectures exhibit non-uniform memory access (NUMA) behavior, where access by a core to data cached locally on a NUMA node is much faster than access to data cached on a remote node. Prior work has shown that on the NUMA multi-core system, delegation is the highly efficient solution to address the generally poor performance of locking methods on highly contended locks due to delegating the execution of critical section to one thread, reducing the movement of shared data between cores. However, we observe that delegation methods exhibit sub-par single-thread performance due mainly to the overheads of the communication between the server and client threads and request array traversal. To address this problem, this paper proposes a stealing mechanism that under no contention the clients can directly enter the critical section by enabling lock stealing. Meanwhile, under contention it employs delegation protocol by disabling lock stealing. Furthermore, we apply stealing mechanism to the state-of-the-art delegation methods: FFWD and RCL. The evaluation shows that delegation augmented with lock stealing can significantly improve the performance in the non-contended case, while matching the performance of their original counterparts in the contended circumstances.
Similar content being viewed by others
References
Anderson TE (1990) The performance of spin lock alternatives for shared memory multiprocessors. IEEE Trans Parallel Distrib Syst 1(1):6–16
Baumann A, Barham P, Dagand PE, Tyshasta H, Rebecca I, Simon P, Timothy R, Adrian S, Akhilesh S (2009) The multi-kernel: a new OS architecture for scalable multicore systems. The ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP 09), pp. 29-44
Baumann A, Peter S, Adrian S, Akhilesh S, Timothy R, Paul B, Rebecca I (2009) Your computer is already a distributed system. Why Isn’t Your OS? Workshop on Workstation Operating Systems /workshop on Hot Topics in Operating Systems (HotOS 09), pp. 12-12
Bienia C (2011) Benchmarking modern multiprocessors. Ph.D. Dissertation. Princeton University, Princeton, NJ
Boyd-Wickizer S, Clements AT, Mao Y, Pesterev A, Kaashoek MF, Morris RT, Zeldovich N (2010) An analysis of Linux scalability to many cores. In: Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Association, Vancouver, Canada, pp. 1-16
Boyd-Wickizer S, Kaashoek MF, Morris R, Zeldovich N (2012) Non-scalable locks are dangerous. In: Proceedings of the Linux Symposium. Ottawa, Canada
Chabbi M, Fagan M, Mellor-Crummey J (2015) High performance locks for multi-level NUMA systems. In: Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 215-226
Chabbi M, Mellor-Crummey J (2016) Contention-conscious, locality-preserving locks. In: Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP)
Craig T (1993) Building FIFO and priority queuing spin locks from atomic swap, Univ. Washington, Seattle, WA, USA, Tech. Rep. 93-02-02
Dave Dice, Virendra J. Marathe, and Nir Shavit. Flat-combining NUMA locks. In Proceedings of the Twenty-third Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 65-74, 2011
David T, Guerraoui R, Trigonakis V (2013) Everything you always wanted to know about synchronization but were afraid to ask. In: Proceedings of the 24th ACM Symposium on Operating Systems Principles(SOSP 13), pp. 33-48
Dice D (2017) Malthusian locks. In: Proceedings of the Twelfth European Conference on Computer Systems (EuroSys), pp. 314-327
Dice D, Marathe VJ, Shavit N (2015) Lock cohorting: a general technique for designing NUMA locks. ACM Trans Parallel Comput 1(2):13:1-13:42
Dice D, Kogan A (2019) Compact NUMA-aware Locks. In: Proceedings of the Fourteenth EuroSys Conference 2019 (EuroSys’19). ACM, New York, NY, USA, Article 12, 15 pages
Dogan H, Hijaz F, Ahmad M, Kahne B, Wilson P, Khan O (2017). Accelerating graph and machine learning workloads using a shared memory multicore architecture with auxiliary support for in-hardware explicit messaging. In Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International. IEEE, pp. 254-264
Eyerman S, Eeckhout L (2010) Modeling critical sections in Amdahl’s law and its implications for multicore design. In: Proceedings of the Annual International Symposium on Computer Architecture (ISCA), pp. 362-370
Fatourou P, Kallimanis ND (2012) Revisiting the combining synchronization technique. In: Proceedings of the 17th ACM Symposium on Principles and Practice of Parallel Programming (PPOPP), pp. 257-266, New Orleans, LA
Guerraoui R, Guiroux H, Lachaize R, Quma V, Trigonakis V (2019) Lock-unlock: Is that all? a pragmatic analysis of locking in software systems. ACM Trans Comput Syst 36(1):1:1-1:149
Guerraoui R, Trigonakis V (2016) Optimistic concurrency with OPTIK. In: Proceedings of the 21st ACM Symposium on Principles and Practice of Parallel Programming (PPoPP). ACM, Barcelona, Spain, pp. 18:1-18:12
Guiroux H, Lachaize R, Quéma V (2016) Multicore locks: the case is not closed yet. In: Proceedings of the USENIX Annual Technical Conference (ATC), pp. 649-662
Hackenberg D, Molka D, Nagel W (2009) Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems. ACM International Symposium on Microarchitecture(MICRO 09), pp. 413- 422
Hendler D, Incze I, Shavit N, Tzafrir M (2010) Flat combining and the synchronization-parallelism tradeoff. In: Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures, pp. 355-364. ACM
He B, Scherer WN, Scott ML (2005) Preemption Adaptivity in Time-published Queue-based Spin Locks. In: Proceedings of the 12th International Conference on High Performance Computing (HiPC’05) (2005), Springer-Verlag
Kashyap S, Calciu I, Cheng X, Min C, Kim T (2019) Scalable and practical locking with shuffling. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP), pp. 586-599
Kashyap S, Min C, Kim T (2017) Scalable NUMA-aware Blocking Synchronization Primitives. In: Proceedings of the (2017) USENIX Annual Technical Conference (ATC). USENIX Association, Santa Clara, CA
Kim J, Mathew A, Kashyap S, Ramanathan MK, Min C (2019) MV-RLU: scaling read-log-update with multi-versioning. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 779-792
Klaftenegger D, Sagonas K, Winblad K (2018) Queue delegation locking. IEEE Trans Parallel Distrib Syst 29(3):687–704
Liu T, Berger ED (2011) Sheriff: precise detection and automatic mitigation of false sharing. ACM Sigplan Notices 46:3–18
Lozi JP, David F, Thomas G, Lawall J, Muller G (2016) Fast and portable locking for multicore architectures. ACM Trans. Comput. Syst., 33(4):13:1-13:62
Lozi JP , David F, Thomas G, Lawall JL, Muller G et al. (2012) Remote core locking: Migrating critical-section execution to improve the performance of multithreaded applications. In: USENIX Annual Technical Conference, pp. 65-76
Magnusson PS, Landin A, Hagersten E (1994) Queue locks on cache coherent multiprocessors. In: Proceedings of the 8th International Parallel Processing Symposium, pages pp. 165-171
Mellor-Crummey JM, Scott ML (1991) Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans Comput Syst. https://doi.org/10.1145/103727.103729
Memcached, . https://memcached.org
Nanavati M, Spear M, Taylor N, Rajagopalan S, Meyer DT, Aiello W, Warfield A (2013) Whose cache line is it anyway?: operating system support for live detection and repair of false sharing. In Proceedings of the 8th ACM European Conference on Computer Systems, pp. 141-154. ACM, 2013
Oyama Y, Taura K, Yonezawa A (1999) Executing parallel programs with synchronization bottlenecks efficiently, In: Proc. Int. Workshop Parallel Distrib. Comput. Symbolic Irregular Appl., pp. 182-204
Radovic Z, Hagersten E (2003) Hierarchical backoff locks for nonuniform communication architectures. In International Symposium on High- Performance Computer Architecture (HPCA), pp. 241-252
Ranger C, Raghuraman R, Penmetsa A, Bradski G, Kozyrakis C (2007) Evaluating mapreduce for multi-core and multiprocessor systems. In High Performance Computer Architecture. HPCA (2007) IEEE 13th International Symposium on, pp. 13-24. Ieee, 2007
Roghanchi S, Eriksson J, Basu N (2017) ffwd: delegation is (much) faster than you think. In: Proceedings of the 26th Symposium on Operating Systems Principles. ACM, pp. 342-358
Splash-2. http://www.capsl.udel.edu/splash
Tang X , Zhai J, Qian X, Chen W (2019) pLock: a fast lock for architectures with explicit inter-core message passing. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 765-778
Victor Luchangco, Dan Nussbaum, and Nir Shavit. A hierarchical CLH queue lock. In Proceedings of the 12th International Conference on Parallel Processing (EuroPar), pages 801-810, 2006
Zhang M, Chen H, Cheng L, Lau FCM, Wang CL (2017) Scalable Adaptive NUMA-Aware Lock. IEEE Trans Parallel Distrib Syst 28(6):1754–1769
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Yi, Z., Yao, Y. A stealing mechanism for delegation methods. J Supercomput 77, 10827–10849 (2021). https://doi.org/10.1007/s11227-021-03719-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-021-03719-2