research-article

Public Access

pLock: A Fast Lock for Architectures with Explicit Inter-core Message Passing

Authors:

Xiongchao Tang,

Wenguang ChenAuthors Info & Claims

ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems

Pages 765 - 778

https://doi.org/10.1145/3297858.3304030

Published: 04 April 2019 Publication History

Abstract

Synchronization is a significant issue for multi-threaded programs. Mutex lock, as a classic solution, is widely used in legacy programs and is still popular for its intuition. The SW26010 architecture, deployed on the supercomputer Sunway Taihulight, introduces hardware-supported inter-core message passing mechanism and exposes explicit interfaces for developers to use its fast on-chip network. This emerging architectural feature brings both opportunities and challenges for mutex lock implementation. However, there is still no general lock mechanism optimized for architectures with this new feature. In this paper, we propose pLock, a fast lock designed for architectures that support Explicit inter-core Message Passing (EMP). pLock uses partial cores as lock servers and leverages the fast on-chip network to implement high-performance mutual exclusive locks. We propose two new techniques -- chaining lock and hierarchical lock -- to reduce message count and mitigate network congestion. We implement and evaluate pLock on an SW26010 processor. The experimental results show that our proposed techniques improve the performance of EMP-lock by up to 19.4x over a basic design.

References

[1]

Masab Ahmad, Farrukh Hijaz, Qingchuan Shi, and Omer Khan. 2015. CRONO: A Benchmark Suite for Multithreaded Graph Algorithms Executing on Futuristic Multicores. In Proceedings of the 2015 IEEE International Symposium on Workload Characterization (IISWC '15).

Digital Library

[2]

James H Anderson and Mark Moir. 1995. Universal constructions for large objects. In International Workshop on Distributed Algorithms. Springer, 168--182.

Digital Library

[3]

Thomas E. Anderson. 1990. The performance of spin lock alternatives for shared-money multiprocessors. IEEE Transactions on Parallel and Distributed Systems, Vol. 1, 1 (1990), 6--16.

Digital Library

[4]

Mike Burrows. 2006. The Chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th symposium on Operating systems design and implementation. USENIX Association, 335--350.

Digital Library

[5]

Calin Cascaval, Colin Blundell, Maged Michael, Harold W Cain, Peng Wu, Stefanie Chiras, and Siddhartha Chatterjee. 2008. Software transactional memory: Why is it only a research toy? Queue, Vol. 6, 5 (2008), 40.

Digital Library

[6]

Travis Craig. 1993. Building FIFO and priority queuing spin locks from atomic swap. Technical Report. Technical Report TR 93-02-02, University of Washington, 02 1993.

[7]

Tudor David, Rachid Guerraoui, and Vasileios Trigonakis. 2013. Everything you always wanted to know about synchronization but were afraid to ask. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 33--48.

Digital Library

[8]

David Dice, Virendra J Marathe, and Nir Shavit. 2015. Lock cohorting: A general technique for designing NUMA locks. ACM Transactions on Parallel Computing, Vol. 1, 2 (2015), 13.

Digital Library

[9]

Halit Dogan, Farrukh Hijaz, Masab Ahmad, Brian Kahne, Peter Wilson, and Omer Khan. 2017. Accelerating Graph and Machine Learning Workloads Using a Shared Memory Multicore Architecture with Auxiliary Support for In-hardware Explicit Messaging. In Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International. IEEE, 254--264.

[10]

Panagiota Fatourou and Nikolaos D Kallimanis. 2011. A highly-efficient wait-free universal construction. In Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures. ACM, 325--334.

Digital Library

[11]

Panagiota Fatourou and Nikolaos D Kallimanis. 2012. Revisiting the combining synchronization technique. In ACM SIGPLAN Notices, Vol. 47. ACM, 257--266.

Digital Library

[12]

Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, Zhenya Song, Xiaomeng Huang, Chao Yang, Wei Xue, Fangfang Liu, Fangli Qiao, et al. 2016. The Sunway TaihuLight supercomputer: system and applications. Science China Information Sciences, Vol. 59, 7 (2016), 072001.

[13]

James R Goodman, Mary K Vernon, and Philip J Woest. 1989. Efficient synchronization primitives for large-scale cache-coherent multiprocessors. In Proceedings of the Third Symposium on Architectural Support for Programming Languages and Operating Systems. ACM, 64--75.

Digital Library

[14]

Hugo Guiroux, Renaud Lachaize, and Vivien Quéma. 2016. Multicore Locks: The Case Is Not Closed Yet. In USENIX Annual Technical Conference. 649--662.

Digital Library

[15]

Syed Kamran Haider, William Hasenplaugh, and Dan Alistarh. 2016. Lease/release: Architectural support for scaling contended data structures. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 17.

Digital Library

[16]

Danny Hendler, Itai Incze, Nir Shavit, and Moran Tzafrir. 2010a. Flat combining and the synchronization-parallelism tradeoff. In Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures. ACM, 355--364.

Digital Library

[17]

Danny Hendler, Nir Shavit, and Lena Yerushalmi. 2010b. A scalable lock-free stack algorithm. J. Parallel and Distrib. Comput., Vol. 70, 1 (2010).

Digital Library

[18]

Moshe Hoffman, Ori Shalev, and Nir Shavit. 2007. The baskets queue. In International Conference On Principles Of Distributed Systems. Springer, 401--414.

Digital Library

[19]

Alain Kagi, Doug Burger, and James R Goodman. 1997. Efficient synchronization: Let them eat QOLB. In ACM SIGARCH Computer Architecture News, Vol. 25. ACM, 170--180.

Digital Library

[20]

David Klaftenegger, Konstantinos Sagonas, and Kjell Winblad. 2014. Brief announcement: Queue delegation locking. In Proceedings of the 26th ACM symposium on Parallelism in algorithms and architectures. ACM, 70--72.

Digital Library

[21]

Byung-Jae Kwak, Nah-Oak Song, and Leonard E Miller. 2005. Performance analysis of exponential backoff. IEEE/ACM transactions on networking, Vol. 13, 2 (2005), 343--355.

Digital Library

[22]

M. LeBeane, S. Song, R. Panda, J. H. Ryoo, and L. K. John. 2015. Data partitioning strategies for graph workloads on heterogeneous clusters. In SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--12.

Digital Library

[23]

Heng Lin, Xiongchao Tang, Bowen Yu, Youwei Zhuo, Wenguang Chen, Jidong Zhai, Wanwang Yin, and Weimin Zheng. 2017. Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores. In Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International. IEEE, 635--645.

[24]

Jean-Pierre Lozi, Florian David, Gaël Thomas, Julia L Lawall, Gilles Muller, et al. 2012. Remote Core Locking: Migrating Critical-Section Execution to Improve the Performance of Multithreaded Applications. In USENIX Annual Technical Conference. 65--76.

Digital Library

[25]

Victor Luchangco, Dan Nussbaum, and Nir Shavit. 2006. A hierarchical CLH queue lock. Euro-Par 2006 Parallel Processing (2006), 801--810.

Digital Library

[26]

John M Mellor-Crummey and Michael L Scott. 1991. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems (TOCS), Vol. 9, 1 (1991), 21--65.

Digital Library

[27]

Maged M Michael. 2002. High performance dynamic lock-free hash tables and list-based sets. In Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures. ACM, 73--82.

Digital Library

[28]

Takuya Nakaike, Rei Odaira, Matthew Gaudet, Maged M Michael, and Hisanobu Tomari. 2015. Quantitative comparison of hardware transactional memory for Blue Gene/Q, zEnterprise EC12, Intel Core, and POWER8. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 144--157.

Digital Library

[29]

Darko Petrović, Thomas Ropars, and André Schiper. 2014. Leveraging hardware message passing for efficient thread synchronization. ACM SIGPLAN Notices, Vol. 49, 8 (2014), 143--154.

Digital Library

[30]

Carl Ramey. 2011. Tile-gx100 many core processor: Acceleration interfaces and architecture. In Hot Chips 23 Symposium (HCS), 2011 IEEE. IEEE, 1--21.

[31]

Sabela Ramos and Torsten Hoefler. 2013. Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi. In Proceedings of the 22nd international symposium on High-performance parallel and distributed computing. ACM, 97--108.

Digital Library

[32]

Sepideh Roghanchi, Jakob Eriksson, and Nilanjana Basu. 2017. ffwd: delegation is (much) faster than you think. Proceedings of the 26th Symposium on Operating Systems Principles. ACM, 342--358.

Digital Library

[33]

Bratin Saha, Ali-Reza Adl-Tabatabai, Richard L Hudson, Chi Cao Minh, and Benjamin Hertzberg. 2006. McRT-STM: a high performance software transactional memory system for a multi-core runtime. In Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming. ACM, 187--197.

Digital Library

[34]

Michael L Scott and William N Scherer. 2001. Scalable queue-based spin locks with timeout. In ACM SIGPLAN Notices, Vol. 36. ACM, 44--52.

Digital Library

[35]

Shuang Song, Xu Liu, Qinzhe Wu, Andreas Gerstlauer, Tao Li, and Lizy K. John. 2018. Start Late or Finish Early: A Distributed Graph Processing System with Redundancy Reduction. PVLDB, Vol. 12, 2 (2018), 154--168.

Digital Library

[36]

Nathan R Tallent, John M Mellor-Crummey, and Allan Porterfield. 2010. Analyzing lock contention in multithreaded applications. In ACM Sigplan Notices, Vol. 45. ACM, 269--280.

Digital Library

[37]

Chao Yang, Wei Xue, Haohuan Fu, Hongtao You, Xinliang Wang, Yulong Ao, Fangfang Liu, Lin Gan, Ping Xu, Lanning Wang, et al. 2016. 10M-core scalable fully-implicit solver for non-hydrostatic atmospheric dynamics. In High Performance Computing, Networking, Storage and Analysis, SC16: International Conference for. IEEE, 57--68.

Digital Library

[38]

Zhiyi Yu, Ruijin Xiao, Kaidi You, Heng Quan, Peng Ou, Zheng Yu, Maofei He, Jiajie Zhang, Yan Ying, Haofan Yang, et al. 2014. A 16-core processor with shared-memory and message-passing communications. IEEE Transactions on Circuits and Systems I: Regular Papers, Vol. 61, 4 (2014), 1081--1094.

[39]

Mingzhe Zhang, Haibo Chen, Luwei Cheng, Francis CM Lau, and Cho-Li Wang. 2017. Scalable Adaptive NUMA-Aware Lock. IEEE Transactions on Parallel and Distributed Systems, Vol. 28, 6 (2017), 1754--1769.

Digital Library

Cited By

Zhang WZhao CPeng LLin YZhang FLu YDehnavi MKulkarni MKrishnamoorthy S(2023)Boosting Performance and QoS for Concurrent GPU B+trees by Combining-Based SynchronizationProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577474(1-13)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577474
Zhou ZLi CYang FSun G(2023)DIMM-Link: Enabling Efficient Inter-DIMM Communication for Near-Memory Processing2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071005(302-316)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071005
Zhang WZhao CPeng LLin YZhang FJiang JLee JAgrawal KSpear M(2022)High performance GPU concurrent B+treeProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508419(443-444)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1145/3503221.3508419
Show More Cited By

Index Terms

pLock: A Fast Lock for Architectures with Explicit Inter-core Message Passing
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures
  2. Dependable and fault-tolerant systems and networks
    1. Processors and memory architectures
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
        Concurrency control
        Multithreading
        Mutual exclusion

Recommendations

Lock elision for read-only critical sections in Java
PLDI '10

It is not uncommon in parallel workloads to encounter shared data structures with read-mostly access patterns, where operations that update data are infrequent and most operations are read-only. Typically, data consistency is guaranteed using mutual ...
Lock elision for read-only critical sections in Java
PLDI '10: Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation

It is not uncommon in parallel workloads to encounter shared data structures with read-mostly access patterns, where operations that update data are infrequent and most operations are read-only. Typically, data consistency is guaranteed using mutual ...
Lock reservation: Java locks can mostly do without atomic operations
OOPSLA '02: Proceedings of the 17th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications

Because of the built-in support for multi-threaded programming, Java programs perform many lock operations. Although the overhead has been significantly reduced in the recent virtual machines, One or more atomic operations are required for acquiring and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems

April 2019

1126 pages

ISBN:9781450362405

DOI:10.1145/3297858

General Chairs:
Iris Bahar
Brown University
,
Maurice Herlihy
Brown University
,
Program Chairs:
Emmett Witchel
University of Texas, Austin
,
Alvin Lebeck
Duke University

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 April 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
National Science Foundation
National Key R&D Program of China

Conference

ASPLOS '19

Sponsor:

ASPLOS '19: Architectural Support for Programming Languages and Operating Systems

April 13 - 17, 2019

RI, Providence, USA

Acceptance Rates

ASPLOS '19 Paper Acceptance Rate 74 of 351 submissions, 21%;

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
824
Total Downloads

Downloads (Last 12 months)133
Downloads (Last 6 weeks)11

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang WZhao CPeng LLin YZhang FLu YDehnavi MKulkarni MKrishnamoorthy S(2023)Boosting Performance and QoS for Concurrent GPU B+trees by Combining-Based SynchronizationProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577474(1-13)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577474
Zhou ZLi CYang FSun G(2023)DIMM-Link: Enabling Efficient Inter-DIMM Communication for Near-Memory Processing2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071005(302-316)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071005
Zhang WZhao CPeng LLin YZhang FJiang JLee JAgrawal KSpear M(2022)High performance GPU concurrent B+treeProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508419(443-444)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1145/3503221.3508419
Yi ZYao YChen KGunawi HMa X(2021)FTSDProceedings of the 12th ACM SIGOPS Asia-Pacific Workshop on Systems10.1145/3476886.3477518(123-130)Online publication date: 24-Aug-2021
https://dl.acm.org/doi/10.1145/3476886.3477518
Giannoula CVijaykumar NPapadopoulou NKarakostas VFernandez IGomez-Luna JOrosa LKoziris NGoumas GMutlu O(2021)SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00031(263-276)Online publication date: Feb-2021
https://doi.org/10.1109/HPCA51647.2021.00031
Yi ZYao Y(2021)A stealing mechanism for delegation methodsThe Journal of Supercomputing10.1007/s11227-021-03719-2Online publication date: 12-Mar-2021
https://doi.org/10.1007/s11227-021-03719-2
Hu YYang HLuan ZGan LYang GQian D(2020)Massively Scaling Seismic Processing on Sunway TaihuLight SupercomputerIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.296239531:5(1194-1208)Online publication date: 1-May-2020
https://doi.org/10.1109/TPDS.2019.2962395
Tang XTang XZhang CZhai JQian XChen WJiang Y(2020)A Fast Lock for Explicit Message Passing ArchitecturesIEEE Transactions on Computers10.1109/TC.2020.3015727(1-1)Online publication date: 2020
https://doi.org/10.1109/TC.2020.3015727
Yi ZYao Y(2020)A scalable lock on NUMA multicoreConcurrency and Computation: Practice and Experience10.1002/cpe.596432:24Online publication date: 14-Aug-2020
https://doi.org/10.1002/cpe.5964
Hetland CTziantzioulis GSuchy BLeonard MHan JAlbers JHardavellas NDinda PWeissman JButt ASmirni E(2019)Paths to Fast Barrier Synchronization on the NodeProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3307681.3325402(109-120)Online publication date: 17-Jun-2019
https://dl.acm.org/doi/10.1145/3307681.3325402
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents