Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3297858.3304030acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article
Public Access

pLock: A Fast Lock for Architectures with Explicit Inter-core Message Passing

Published: 04 April 2019 Publication History
  • Get Citation Alerts
  • Abstract

    Synchronization is a significant issue for multi-threaded programs. Mutex lock, as a classic solution, is widely used in legacy programs and is still popular for its intuition. The SW26010 architecture, deployed on the supercomputer Sunway Taihulight, introduces hardware-supported inter-core message passing mechanism and exposes explicit interfaces for developers to use its fast on-chip network. This emerging architectural feature brings both opportunities and challenges for mutex lock implementation. However, there is still no general lock mechanism optimized for architectures with this new feature. In this paper, we propose pLock, a fast lock designed for architectures that support Explicit inter-core Message Passing (EMP). pLock uses partial cores as lock servers and leverages the fast on-chip network to implement high-performance mutual exclusive locks. We propose two new techniques -- chaining lock and hierarchical lock -- to reduce message count and mitigate network congestion. We implement and evaluate pLock on an SW26010 processor. The experimental results show that our proposed techniques improve the performance of EMP-lock by up to 19.4x over a basic design.

    References

    [1]
    Masab Ahmad, Farrukh Hijaz, Qingchuan Shi, and Omer Khan. 2015. CRONO: A Benchmark Suite for Multithreaded Graph Algorithms Executing on Futuristic Multicores. In Proceedings of the 2015 IEEE International Symposium on Workload Characterization (IISWC '15).
    [2]
    James H Anderson and Mark Moir. 1995. Universal constructions for large objects. In International Workshop on Distributed Algorithms. Springer, 168--182.
    [3]
    Thomas E. Anderson. 1990. The performance of spin lock alternatives for shared-money multiprocessors. IEEE Transactions on Parallel and Distributed Systems, Vol. 1, 1 (1990), 6--16.
    [4]
    Mike Burrows. 2006. The Chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th symposium on Operating systems design and implementation. USENIX Association, 335--350.
    [5]
    Calin Cascaval, Colin Blundell, Maged Michael, Harold W Cain, Peng Wu, Stefanie Chiras, and Siddhartha Chatterjee. 2008. Software transactional memory: Why is it only a research toy? Queue, Vol. 6, 5 (2008), 40.
    [6]
    Travis Craig. 1993. Building FIFO and priority queuing spin locks from atomic swap. Technical Report. Technical Report TR 93-02-02, University of Washington, 02 1993.
    [7]
    Tudor David, Rachid Guerraoui, and Vasileios Trigonakis. 2013. Everything you always wanted to know about synchronization but were afraid to ask. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 33--48.
    [8]
    David Dice, Virendra J Marathe, and Nir Shavit. 2015. Lock cohorting: A general technique for designing NUMA locks. ACM Transactions on Parallel Computing, Vol. 1, 2 (2015), 13.
    [9]
    Halit Dogan, Farrukh Hijaz, Masab Ahmad, Brian Kahne, Peter Wilson, and Omer Khan. 2017. Accelerating Graph and Machine Learning Workloads Using a Shared Memory Multicore Architecture with Auxiliary Support for In-hardware Explicit Messaging. In Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International. IEEE, 254--264.
    [10]
    Panagiota Fatourou and Nikolaos D Kallimanis. 2011. A highly-efficient wait-free universal construction. In Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures. ACM, 325--334.
    [11]
    Panagiota Fatourou and Nikolaos D Kallimanis. 2012. Revisiting the combining synchronization technique. In ACM SIGPLAN Notices, Vol. 47. ACM, 257--266.
    [12]
    Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, Zhenya Song, Xiaomeng Huang, Chao Yang, Wei Xue, Fangfang Liu, Fangli Qiao, et al. 2016. The Sunway TaihuLight supercomputer: system and applications. Science China Information Sciences, Vol. 59, 7 (2016), 072001.
    [13]
    James R Goodman, Mary K Vernon, and Philip J Woest. 1989. Efficient synchronization primitives for large-scale cache-coherent multiprocessors. In Proceedings of the Third Symposium on Architectural Support for Programming Languages and Operating Systems. ACM, 64--75.
    [14]
    Hugo Guiroux, Renaud Lachaize, and Vivien Quéma. 2016. Multicore Locks: The Case Is Not Closed Yet. In USENIX Annual Technical Conference. 649--662.
    [15]
    Syed Kamran Haider, William Hasenplaugh, and Dan Alistarh. 2016. Lease/release: Architectural support for scaling contended data structures. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 17.
    [16]
    Danny Hendler, Itai Incze, Nir Shavit, and Moran Tzafrir. 2010a. Flat combining and the synchronization-parallelism tradeoff. In Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures. ACM, 355--364.
    [17]
    Danny Hendler, Nir Shavit, and Lena Yerushalmi. 2010b. A scalable lock-free stack algorithm. J. Parallel and Distrib. Comput., Vol. 70, 1 (2010).
    [18]
    Moshe Hoffman, Ori Shalev, and Nir Shavit. 2007. The baskets queue. In International Conference On Principles Of Distributed Systems. Springer, 401--414.
    [19]
    Alain Kagi, Doug Burger, and James R Goodman. 1997. Efficient synchronization: Let them eat QOLB. In ACM SIGARCH Computer Architecture News, Vol. 25. ACM, 170--180.
    [20]
    David Klaftenegger, Konstantinos Sagonas, and Kjell Winblad. 2014. Brief announcement: Queue delegation locking. In Proceedings of the 26th ACM symposium on Parallelism in algorithms and architectures. ACM, 70--72.
    [21]
    Byung-Jae Kwak, Nah-Oak Song, and Leonard E Miller. 2005. Performance analysis of exponential backoff. IEEE/ACM transactions on networking, Vol. 13, 2 (2005), 343--355.
    [22]
    M. LeBeane, S. Song, R. Panda, J. H. Ryoo, and L. K. John. 2015. Data partitioning strategies for graph workloads on heterogeneous clusters. In SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--12.
    [23]
    Heng Lin, Xiongchao Tang, Bowen Yu, Youwei Zhuo, Wenguang Chen, Jidong Zhai, Wanwang Yin, and Weimin Zheng. 2017. Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores. In Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International. IEEE, 635--645.
    [24]
    Jean-Pierre Lozi, Florian David, Gaël Thomas, Julia L Lawall, Gilles Muller, et al. 2012. Remote Core Locking: Migrating Critical-Section Execution to Improve the Performance of Multithreaded Applications. In USENIX Annual Technical Conference. 65--76.
    [25]
    Victor Luchangco, Dan Nussbaum, and Nir Shavit. 2006. A hierarchical CLH queue lock. Euro-Par 2006 Parallel Processing (2006), 801--810.
    [26]
    John M Mellor-Crummey and Michael L Scott. 1991. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems (TOCS), Vol. 9, 1 (1991), 21--65.
    [27]
    Maged M Michael. 2002. High performance dynamic lock-free hash tables and list-based sets. In Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures. ACM, 73--82.
    [28]
    Takuya Nakaike, Rei Odaira, Matthew Gaudet, Maged M Michael, and Hisanobu Tomari. 2015. Quantitative comparison of hardware transactional memory for Blue Gene/Q, zEnterprise EC12, Intel Core, and POWER8. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 144--157.
    [29]
    Darko Petrović, Thomas Ropars, and André Schiper. 2014. Leveraging hardware message passing for efficient thread synchronization. ACM SIGPLAN Notices, Vol. 49, 8 (2014), 143--154.
    [30]
    Carl Ramey. 2011. Tile-gx100 many core processor: Acceleration interfaces and architecture. In Hot Chips 23 Symposium (HCS), 2011 IEEE. IEEE, 1--21.
    [31]
    Sabela Ramos and Torsten Hoefler. 2013. Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi. In Proceedings of the 22nd international symposium on High-performance parallel and distributed computing. ACM, 97--108.
    [32]
    Sepideh Roghanchi, Jakob Eriksson, and Nilanjana Basu. 2017. ffwd: delegation is (much) faster than you think. Proceedings of the 26th Symposium on Operating Systems Principles. ACM, 342--358.
    [33]
    Bratin Saha, Ali-Reza Adl-Tabatabai, Richard L Hudson, Chi Cao Minh, and Benjamin Hertzberg. 2006. McRT-STM: a high performance software transactional memory system for a multi-core runtime. In Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming. ACM, 187--197.
    [34]
    Michael L Scott and William N Scherer. 2001. Scalable queue-based spin locks with timeout. In ACM SIGPLAN Notices, Vol. 36. ACM, 44--52.
    [35]
    Shuang Song, Xu Liu, Qinzhe Wu, Andreas Gerstlauer, Tao Li, and Lizy K. John. 2018. Start Late or Finish Early: A Distributed Graph Processing System with Redundancy Reduction. PVLDB, Vol. 12, 2 (2018), 154--168.
    [36]
    Nathan R Tallent, John M Mellor-Crummey, and Allan Porterfield. 2010. Analyzing lock contention in multithreaded applications. In ACM Sigplan Notices, Vol. 45. ACM, 269--280.
    [37]
    Chao Yang, Wei Xue, Haohuan Fu, Hongtao You, Xinliang Wang, Yulong Ao, Fangfang Liu, Lin Gan, Ping Xu, Lanning Wang, et al. 2016. 10M-core scalable fully-implicit solver for non-hydrostatic atmospheric dynamics. In High Performance Computing, Networking, Storage and Analysis, SC16: International Conference for. IEEE, 57--68.
    [38]
    Zhiyi Yu, Ruijin Xiao, Kaidi You, Heng Quan, Peng Ou, Zheng Yu, Maofei He, Jiajie Zhang, Yan Ying, Haofan Yang, et al. 2014. A 16-core processor with shared-memory and message-passing communications. IEEE Transactions on Circuits and Systems I: Regular Papers, Vol. 61, 4 (2014), 1081--1094.
    [39]
    Mingzhe Zhang, Haibo Chen, Luwei Cheng, Francis CM Lau, and Cho-Li Wang. 2017. Scalable Adaptive NUMA-Aware Lock. IEEE Transactions on Parallel and Distributed Systems, Vol. 28, 6 (2017), 1754--1769.

    Cited By

    View all
    • (2023)Boosting Performance and QoS for Concurrent GPU B+trees by Combining-Based SynchronizationProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577474(1-13)Online publication date: 25-Feb-2023
    • (2023)DIMM-Link: Enabling Efficient Inter-DIMM Communication for Near-Memory Processing2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071005(302-316)Online publication date: Feb-2023
    • (2022)High performance GPU concurrent B+treeProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508419(443-444)Online publication date: 2-Apr-2022
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems
    April 2019
    1126 pages
    ISBN:9781450362405
    DOI:10.1145/3297858
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 April 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. inter-core message passing
    2. lock
    3. on-chip network
    4. synchronization

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ASPLOS '19

    Acceptance Rates

    ASPLOS '19 Paper Acceptance Rate 74 of 351 submissions, 21%;
    Overall Acceptance Rate 535 of 2,713 submissions, 20%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)133
    • Downloads (Last 6 weeks)11
    Reflects downloads up to 10 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Boosting Performance and QoS for Concurrent GPU B+trees by Combining-Based SynchronizationProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577474(1-13)Online publication date: 25-Feb-2023
    • (2023)DIMM-Link: Enabling Efficient Inter-DIMM Communication for Near-Memory Processing2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071005(302-316)Online publication date: Feb-2023
    • (2022)High performance GPU concurrent B+treeProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508419(443-444)Online publication date: 2-Apr-2022
    • (2021)FTSDProceedings of the 12th ACM SIGOPS Asia-Pacific Workshop on Systems10.1145/3476886.3477518(123-130)Online publication date: 24-Aug-2021
    • (2021)SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00031(263-276)Online publication date: Feb-2021
    • (2021)A stealing mechanism for delegation methodsThe Journal of Supercomputing10.1007/s11227-021-03719-2Online publication date: 12-Mar-2021
    • (2020)Massively Scaling Seismic Processing on Sunway TaihuLight SupercomputerIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.296239531:5(1194-1208)Online publication date: 1-May-2020
    • (2020)A Fast Lock for Explicit Message Passing ArchitecturesIEEE Transactions on Computers10.1109/TC.2020.3015727(1-1)Online publication date: 2020
    • (2020)A scalable lock on NUMA multicoreConcurrency and Computation: Practice and Experience10.1002/cpe.596432:24Online publication date: 14-Aug-2020
    • (2019)Paths to Fast Barrier Synchronization on the NodeProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3307681.3325402(109-120)Online publication date: 17-Jun-2019
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media