Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2903150.2903155acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
research-article

Lock-based synchronization for GPU architectures

Published: 16 May 2016 Publication History

Abstract

Modern GPUs have shown promising results in accelerating compute-intensive and numerical workloads with limited data sharing. However, emerging GPU applications manifest ample amount of data sharing among concurrently executing threads. Often data sharing requires mutual exclusion mechanism to ensure data integrity in multithreaded environment. Although modern GPUs provide atomic primitives that can be leveraged to construct fine-grained locks, the existing GPU lock implementations either incur frequent concurrency bugs, or lead to extremely low hardware utilization due to the Single Instruction Multiple Threads (SIMT) execution paradigm of GPUs.
To make more applications with data sharing benefit from GPU acceleration, we propose a new locking scheme for GPU architectures. The proposed locking scheme allows lock stealing within individual warps to avoid the concurrency bugs due to the SMIT execution of GPUs. Moreover, it adopts lock virtualization to reduce the memory cost of fine-grain GPU locks. To illustrate the usage and the benefit of GPU locks, we apply the proposed GPU locking scheme to Delaunay mesh refinement (DMR), an application involving massive data sharing among threads. Our lock-based implementation can achieve 1.22x speedup over an algorithmic optimization based implementation (which uses a synchronization mechanism tailored for DMR) with 94% less memory cost.

References

[1]
Atomic Operations and Low-Wait Algorithms in CUDA. http://www.drdobbs.com/parallel/atomic-operations-and-low-wait-algorithm/240160177. Online.
[2]
Khronos OpenCL. http://www.khronos.org/opencl/. Online.
[3]
NVIDIA CUDA Programming Guide. http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf. Online.
[4]
NVDIA PTX ISA. http://docs.nvidia.com/cuda/pdf/ptx_isa_4.3.pdf. Online.
[5]
J. Adriaens, K. Compton, N. Kim, and M. Schulte. The case for GPGPU spatial multitasking. In Proc. of the 18th IEEE International Symposium on High Performance Computer Architecture (HPCA), 2012.
[6]
M. Bauer, H. Cook, and B. Khailany. CudaDMA: optimizing GPU memory bandwidth via warp specialization. In Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2011.
[7]
M. Bauer, S. Treichler, and A. Aiken. Singe: Leveraging Warp Specialization for High Performance on GPUs. In Proc. of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2014.
[8]
D. Cederman, P. Tsigas, and M. T. Chaudhry. Towards a Software Transactional Memory for Graphics Processors. In Proc. of the Eurographics Symp. on Parallel Graphics and Visualization (EGPGV), 2010.
[9]
L. P. Chew. Guaranteed-quality mesh generation for curved surfaces. In Proc. Symp. on Computational Geometry (SCG), 1993.
[10]
W. W. L. Fung, I. Singh, A. Brownsword, and T. M. Aamodt. Hardware transactional memory for GPU architectures. In Proc. of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2011.
[11]
W. W. L. Fung and T. M. Aamodt. Energy efficient GPU transactional memory via space-time optimizations. In Proc. of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2013.
[12]
B. He and J. X. Yu. High-throughput transaction executions on graphics processors. In Proc. of the VLDB Endowment (PVLDB), 2011.
[13]
A. Holey and A. Zhai. Lightweight Software Transactions on GPUs. In Proc. of the 43rd International Conference on Parallel Processing (ICPP), 2014.
[14]
A. Li, G. Braak, H. Corporaal, and A. Kumar. Fine-Grained Synchronizations and Dataflow Programming on GPUs. In Proc. of the 29th ACM on International Conference on Supercomputing (ICS), 2015.
[15]
R. Nasre, M. Burtscher, and K. Pingali. Morph algorithms on GPUs. In Proc. of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming (PPoPP), 2013.
[16]
J. Nickolls and W. Dally. The GPU Computing Era. In IEEE Micro, volume 30, page 56, 2010.
[17]
A. Ramamurthy. Towards Scalar Synchronization in SIMT Architectures. Master's thesis, University of British Columbia, 2011.
[18]
J. Sanders and E. Kandrot. CUDA by Example, An Introduction to General Purpose GPU Programming, chapter 9. Addison-Wesley Professional, 2010.
[19]
X. Shi, J. Liang, X. Luo, S. Di, B. He, L. Lu, and H. Jin. Frog: Asynchronous Graph Processing on GPU with Hybrid Coloring Model, Technical Report, HUST-CGCL-TR-402, 2015.
[20]
J. A. Stuart and J. D. Owens. Efficient Synchronization Primitives for GPUs. CoRR, abs/1110.4623(1110.4623v1), October 2011.
[21]
G. Taubenfeld. Synchronization algorithms and concurrent programming. Pearson Education, 2006.
[22]
A. Villegas, A. Navarro, R. Asenjo, O. Plata, R. Ubal, and D. Kaeli. Hardware support for Local Memory Transactions on GPU Architectures. In Proc. of the 10th ACM SIGPLAN Workshop on Transactional Computing (TRANSACT), 2015.
[23]
S. Xiao and W. Feng. Inter-block GPU communication via fast barrier synchronization. In Proc. of the IEEE International Symposium on Parallel Distributed Processing (IPDPS), 2010.
[24]
Y. Xu, R. Wang, N. Goswami, T. Li, L. Gao, and D. Qian. Software Transactional Memory for GPU Architectures. In Proc. of Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2014.
[25]
J. Yan, G. Tan, X. Zhang, E. Yao, and N. Sun. vLock: Lock Virtualization Mechanism for Exploiting Fine-grained Parallelism in Graph Traversal Algorithms. In Proc. of Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2013.
[26]
A. Yilmazer and D. Kaeli. HQL: A Scalable Synchronization Mechanism for GPUs. In Proc. of IEEE 27th International Symposium on Parallel Distributed Processing (IPDPS), 2013.

Cited By

View all
  • (2024)A Framework for Fine-Grained Synchronization of Dependent GPU Kernels2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444873(93-105)Online publication date: 2-Mar-2024
  • (2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
  • (2022)Adaptive Contention Management for Fine-Grained Synchronization on Commodity GPUsACM Transactions on Architecture and Code Optimization10.1145/354730119:4(1-21)Online publication date: 16-Sep-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CF '16: Proceedings of the ACM International Conference on Computing Frontiers
May 2016
487 pages
ISBN:9781450341288
DOI:10.1145/2903150
  • General Chairs:
  • Gianluca Palermo,
  • John Feo,
  • Program Chairs:
  • Antonino Tumeo,
  • Hubertus Franke
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 May 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU
  2. SIMT
  3. lock-based synchronization

Qualifiers

  • Research-article

Conference

CF'16
Sponsor:
CF'16: Computing Frontiers Conference
May 16 - 19, 2016
Como, Italy

Acceptance Rates

CF '16 Paper Acceptance Rate 30 of 94 submissions, 32%;
Overall Acceptance Rate 273 of 785 submissions, 35%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)36
  • Downloads (Last 6 weeks)2
Reflects downloads up to 18 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Framework for Fine-Grained Synchronization of Dependent GPU Kernels2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444873(93-105)Online publication date: 2-Mar-2024
  • (2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
  • (2022)Adaptive Contention Management for Fine-Grained Synchronization on Commodity GPUsACM Transactions on Architecture and Code Optimization10.1145/354730119:4(1-21)Online publication date: 16-Sep-2022
  • (2021)KVCGProceedings of the 14th ACM International Conference on Systems and Storage10.1145/3456727.3463779(1-12)Online publication date: 14-Jun-2021
  • (2021)DACHash: A Dynamic, Cache-Aware and Concurrent Hash Table on GPUs2021 IEEE 33rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD53543.2021.00012(1-10)Online publication date: Oct-2021
  • (2021)sRSP: An efficient and scalable implementation of remote scope promotionConcurrency and Computation: Practice and Experience10.1002/cpe.648334:9Online publication date: 11-Jul-2021
  • (2020)Thread-Level Locking for SIMT ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.295570531:5(1121-1136)Online publication date: 1-May-2020
  • (2020)Don't forget about synchronization! Guidelines for using locks on graphics processing unitsConcurrency and Computation: Practice and Experience10.1002/cpe.575734:2Online publication date: 13-Apr-2020
  • (2019)Don't Forget About Synchronization!Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3303084.3309488(11-20)Online publication date: 17-Feb-2019
  • (2019)Fast Fine-Grained Global Synchronization on GPUsProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304055(793-806)Online publication date: 4-Apr-2019
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media