Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

A stealing mechanism for delegation methods

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Modern multi-core architectures exhibit non-uniform memory access (NUMA) behavior, where access by a core to data cached locally on a NUMA node is much faster than access to data cached on a remote node. Prior work has shown that on the NUMA multi-core system, delegation is the highly efficient solution to address the generally poor performance of locking methods on highly contended locks due to delegating the execution of critical section to one thread, reducing the movement of shared data between cores. However, we observe that delegation methods exhibit sub-par single-thread performance due mainly to the overheads of the communication between the server and client threads and request array traversal. To address this problem, this paper proposes a stealing mechanism that under no contention the clients can directly enter the critical section by enabling lock stealing. Meanwhile, under contention it employs delegation protocol by disabling lock stealing. Furthermore, we apply stealing mechanism to the state-of-the-art delegation methods: FFWD and RCL. The evaluation shows that delegation augmented with lock stealing can significantly improve the performance in the non-contended case, while matching the performance of their original counterparts in the contended circumstances.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Anderson TE (1990) The performance of spin lock alternatives for shared memory multiprocessors. IEEE Trans Parallel Distrib Syst 1(1):6–16

    Article  Google Scholar 

  2. Baumann A, Barham P, Dagand PE, Tyshasta H, Rebecca I, Simon P, Timothy R, Adrian S, Akhilesh S (2009) The multi-kernel: a new OS architecture for scalable multicore systems. The ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP 09), pp. 29-44

  3. Baumann A, Peter S, Adrian S, Akhilesh S, Timothy R, Paul B, Rebecca I (2009) Your computer is already a distributed system. Why Isn’t Your OS? Workshop on Workstation Operating Systems /workshop on Hot Topics in Operating Systems (HotOS 09), pp. 12-12

  4. Bienia C (2011) Benchmarking modern multiprocessors. Ph.D. Dissertation. Princeton University, Princeton, NJ

  5. Boyd-Wickizer S, Clements AT, Mao Y, Pesterev A, Kaashoek MF, Morris RT, Zeldovich N (2010) An analysis of Linux scalability to many cores. In: Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Association, Vancouver, Canada, pp. 1-16

  6. Boyd-Wickizer S, Kaashoek MF, Morris R, Zeldovich N (2012) Non-scalable locks are dangerous. In: Proceedings of the Linux Symposium. Ottawa, Canada

  7. Chabbi M, Fagan M, Mellor-Crummey J (2015) High performance locks for multi-level NUMA systems. In: Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 215-226

  8. Chabbi M, Mellor-Crummey J (2016) Contention-conscious, locality-preserving locks. In: Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP)

  9. Craig T (1993) Building FIFO and priority queuing spin locks from atomic swap, Univ. Washington, Seattle, WA, USA, Tech. Rep. 93-02-02

  10. Dave Dice, Virendra J. Marathe, and Nir Shavit. Flat-combining NUMA locks. In Proceedings of the Twenty-third Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 65-74, 2011

  11. David T, Guerraoui R, Trigonakis V (2013) Everything you always wanted to know about synchronization but were afraid to ask. In: Proceedings of the 24th ACM Symposium on Operating Systems Principles(SOSP 13), pp. 33-48

  12. Dice D (2017) Malthusian locks. In: Proceedings of the Twelfth European Conference on Computer Systems (EuroSys), pp. 314-327

  13. Dice D, Marathe VJ, Shavit N (2015) Lock cohorting: a general technique for designing NUMA locks. ACM Trans Parallel Comput 1(2):13:1-13:42

    Article  Google Scholar 

  14. Dice D, Kogan A (2019) Compact NUMA-aware Locks. In: Proceedings of the Fourteenth EuroSys Conference 2019 (EuroSys’19). ACM, New York, NY, USA, Article 12, 15 pages

  15. Dogan H, Hijaz F, Ahmad M, Kahne B, Wilson P, Khan O (2017). Accelerating graph and machine learning workloads using a shared memory multicore architecture with auxiliary support for in-hardware explicit messaging. In Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International. IEEE, pp. 254-264

  16. Eyerman S, Eeckhout L (2010) Modeling critical sections in Amdahl’s law and its implications for multicore design. In: Proceedings of the Annual International Symposium on Computer Architecture (ISCA), pp. 362-370

  17. Fatourou P, Kallimanis ND (2012) Revisiting the combining synchronization technique. In: Proceedings of the 17th ACM Symposium on Principles and Practice of Parallel Programming (PPOPP), pp. 257-266, New Orleans, LA

  18. Guerraoui R, Guiroux H, Lachaize R, Quma V, Trigonakis V (2019) Lock-unlock: Is that all? a pragmatic analysis of locking in software systems. ACM Trans Comput Syst 36(1):1:1-1:149

    Article  Google Scholar 

  19. Guerraoui R, Trigonakis V (2016) Optimistic concurrency with OPTIK. In: Proceedings of the 21st ACM Symposium on Principles and Practice of Parallel Programming (PPoPP). ACM, Barcelona, Spain, pp. 18:1-18:12

  20. Guiroux H, Lachaize R, Quéma V (2016) Multicore locks: the case is not closed yet. In: Proceedings of the USENIX Annual Technical Conference (ATC), pp. 649-662

  21. Hackenberg D, Molka D, Nagel W (2009) Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems. ACM International Symposium on Microarchitecture(MICRO 09), pp. 413- 422

  22. Hendler D, Incze I, Shavit N, Tzafrir M (2010) Flat combining and the synchronization-parallelism tradeoff. In: Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures, pp. 355-364. ACM

  23. He B, Scherer WN, Scott ML (2005) Preemption Adaptivity in Time-published Queue-based Spin Locks. In: Proceedings of the 12th International Conference on High Performance Computing (HiPC’05) (2005), Springer-Verlag

  24. Kashyap S, Calciu I, Cheng X, Min C, Kim T (2019) Scalable and practical locking with shuffling. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP), pp. 586-599

  25. Kashyap S, Min C, Kim T (2017) Scalable NUMA-aware Blocking Synchronization Primitives. In: Proceedings of the (2017) USENIX Annual Technical Conference (ATC). USENIX Association, Santa Clara, CA

  26. Kim J, Mathew A, Kashyap S, Ramanathan MK, Min C (2019) MV-RLU: scaling read-log-update with multi-versioning. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 779-792

  27. Klaftenegger D, Sagonas K, Winblad K (2018) Queue delegation locking. IEEE Trans Parallel Distrib Syst 29(3):687–704

    Article  Google Scholar 

  28. Liu T, Berger ED (2011) Sheriff: precise detection and automatic mitigation of false sharing. ACM Sigplan Notices 46:3–18

    Article  Google Scholar 

  29. Lozi JP, David F, Thomas G, Lawall J, Muller G (2016) Fast and portable locking for multicore architectures. ACM Trans. Comput. Syst., 33(4):13:1-13:62

  30. Lozi JP , David F, Thomas G, Lawall JL, Muller G et al. (2012) Remote core locking: Migrating critical-section execution to improve the performance of multithreaded applications. In: USENIX Annual Technical Conference, pp. 65-76

  31. Magnusson PS, Landin A, Hagersten E (1994) Queue locks on cache coherent multiprocessors. In: Proceedings of the 8th International Parallel Processing Symposium, pages pp. 165-171

  32. Mellor-Crummey JM, Scott ML (1991) Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans Comput Syst. https://doi.org/10.1145/103727.103729

    Article  Google Scholar 

  33. Memcached, . https://memcached.org

  34. Memslap, . http://docs.libmemcached.org/bin/ memaslap.html

  35. Nanavati M, Spear M, Taylor N, Rajagopalan S, Meyer DT, Aiello W, Warfield A (2013) Whose cache line is it anyway?: operating system support for live detection and repair of false sharing. In Proceedings of the 8th ACM European Conference on Computer Systems, pp. 141-154. ACM, 2013

  36. Oyama Y, Taura K, Yonezawa A (1999) Executing parallel programs with synchronization bottlenecks efficiently, In: Proc. Int. Workshop Parallel Distrib. Comput. Symbolic Irregular Appl., pp. 182-204

  37. Radovic Z, Hagersten E (2003) Hierarchical backoff locks for nonuniform communication architectures. In International Symposium on High- Performance Computer Architecture (HPCA), pp. 241-252

  38. Ranger C, Raghuraman R, Penmetsa A, Bradski G, Kozyrakis C (2007) Evaluating mapreduce for multi-core and multiprocessor systems. In High Performance Computer Architecture. HPCA (2007) IEEE 13th International Symposium on, pp. 13-24. Ieee, 2007

  39. Roghanchi S, Eriksson J, Basu N (2017) ffwd: delegation is (much) faster than you think. In: Proceedings of the 26th Symposium on Operating Systems Principles. ACM, pp. 342-358

  40. Splash-2. http://www.capsl.udel.edu/splash

  41. Tang X , Zhai J, Qian X, Chen W (2019) pLock: a fast lock for architectures with explicit inter-core message passing. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 765-778

  42. Victor Luchangco, Dan Nussbaum, and Nir Shavit. A hierarchical CLH queue lock. In Proceedings of the 12th International Conference on Parallel Processing (EuroPar), pages 801-810, 2006

  43. Zhang M, Chen H, Cheng L, Lau FCM, Wang CL (2017) Scalable Adaptive NUMA-Aware Lock. IEEE Trans Parallel Distrib Syst 28(6):1754–1769

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhengming Yi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yi, Z., Yao, Y. A stealing mechanism for delegation methods. J Supercomput 77, 10827–10849 (2021). https://doi.org/10.1007/s11227-021-03719-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-021-03719-2

Keywords