research-article

Lock-based synchronization for GPU architectures

Authors:

Depei QianAuthors Info & Claims

CF '16: Proceedings of the ACM International Conference on Computing Frontiers

Pages 205 - 213

https://doi.org/10.1145/2903150.2903155

Published: 16 May 2016 Publication History

Abstract

Modern GPUs have shown promising results in accelerating compute-intensive and numerical workloads with limited data sharing. However, emerging GPU applications manifest ample amount of data sharing among concurrently executing threads. Often data sharing requires mutual exclusion mechanism to ensure data integrity in multithreaded environment. Although modern GPUs provide atomic primitives that can be leveraged to construct fine-grained locks, the existing GPU lock implementations either incur frequent concurrency bugs, or lead to extremely low hardware utilization due to the Single Instruction Multiple Threads (SIMT) execution paradigm of GPUs.

To make more applications with data sharing benefit from GPU acceleration, we propose a new locking scheme for GPU architectures. The proposed locking scheme allows lock stealing within individual warps to avoid the concurrency bugs due to the SMIT execution of GPUs. Moreover, it adopts lock virtualization to reduce the memory cost of fine-grain GPU locks. To illustrate the usage and the benefit of GPU locks, we apply the proposed GPU locking scheme to Delaunay mesh refinement (DMR), an application involving massive data sharing among threads. Our lock-based implementation can achieve 1.22x speedup over an algorithmic optimization based implementation (which uses a synchronization mechanism tailored for DMR) with 94% less memory cost.

References

[1]

Atomic Operations and Low-Wait Algorithms in CUDA. http://www.drdobbs.com/parallel/atomic-operations-and-low-wait-algorithm/240160177. Online.

[2]

Khronos OpenCL. http://www.khronos.org/opencl/. Online.

[3]

NVIDIA CUDA Programming Guide. http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf. Online.

[4]

NVDIA PTX ISA. http://docs.nvidia.com/cuda/pdf/ptx_isa_4.3.pdf. Online.

[5]

J. Adriaens, K. Compton, N. Kim, and M. Schulte. The case for GPGPU spatial multitasking. In Proc. of the 18th IEEE International Symposium on High Performance Computer Architecture (HPCA), 2012.

Digital Library

[6]

M. Bauer, H. Cook, and B. Khailany. CudaDMA: optimizing GPU memory bandwidth via warp specialization. In Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2011.

Digital Library

[7]

M. Bauer, S. Treichler, and A. Aiken. Singe: Leveraging Warp Specialization for High Performance on GPUs. In Proc. of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2014.

Digital Library

[8]

D. Cederman, P. Tsigas, and M. T. Chaudhry. Towards a Software Transactional Memory for Graphics Processors. In Proc. of the Eurographics Symp. on Parallel Graphics and Visualization (EGPGV), 2010.

Digital Library

[9]

L. P. Chew. Guaranteed-quality mesh generation for curved surfaces. In Proc. Symp. on Computational Geometry (SCG), 1993.

Digital Library

[10]

W. W. L. Fung, I. Singh, A. Brownsword, and T. M. Aamodt. Hardware transactional memory for GPU architectures. In Proc. of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2011.

Digital Library

[11]

W. W. L. Fung and T. M. Aamodt. Energy efficient GPU transactional memory via space-time optimizations. In Proc. of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2013.

Digital Library

[12]

B. He and J. X. Yu. High-throughput transaction executions on graphics processors. In Proc. of the VLDB Endowment (PVLDB), 2011.

Digital Library

[13]

A. Holey and A. Zhai. Lightweight Software Transactions on GPUs. In Proc. of the 43rd International Conference on Parallel Processing (ICPP), 2014.

Digital Library

[14]

A. Li, G. Braak, H. Corporaal, and A. Kumar. Fine-Grained Synchronizations and Dataflow Programming on GPUs. In Proc. of the 29th ACM on International Conference on Supercomputing (ICS), 2015.

Digital Library

[15]

R. Nasre, M. Burtscher, and K. Pingali. Morph algorithms on GPUs. In Proc. of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming (PPoPP), 2013.

Digital Library

[16]

J. Nickolls and W. Dally. The GPU Computing Era. In IEEE Micro, volume 30, page 56, 2010.

Digital Library

[17]

A. Ramamurthy. Towards Scalar Synchronization in SIMT Architectures. Master's thesis, University of British Columbia, 2011.

[18]

J. Sanders and E. Kandrot. CUDA by Example, An Introduction to General Purpose GPU Programming, chapter 9. Addison-Wesley Professional, 2010.

Digital Library

[19]

X. Shi, J. Liang, X. Luo, S. Di, B. He, L. Lu, and H. Jin. Frog: Asynchronous Graph Processing on GPU with Hybrid Coloring Model, Technical Report, HUST-CGCL-TR-402, 2015.

[20]

J. A. Stuart and J. D. Owens. Efficient Synchronization Primitives for GPUs. CoRR, abs/1110.4623(1110.4623v1), October 2011.

[21]

G. Taubenfeld. Synchronization algorithms and concurrent programming. Pearson Education, 2006.

Digital Library

[22]

A. Villegas, A. Navarro, R. Asenjo, O. Plata, R. Ubal, and D. Kaeli. Hardware support for Local Memory Transactions on GPU Architectures. In Proc. of the 10th ACM SIGPLAN Workshop on Transactional Computing (TRANSACT), 2015.

[23]

S. Xiao and W. Feng. Inter-block GPU communication via fast barrier synchronization. In Proc. of the IEEE International Symposium on Parallel Distributed Processing (IPDPS), 2010.

[24]

Y. Xu, R. Wang, N. Goswami, T. Li, L. Gao, and D. Qian. Software Transactional Memory for GPU Architectures. In Proc. of Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2014.

Digital Library

[25]

J. Yan, G. Tan, X. Zhang, E. Yao, and N. Sun. vLock: Lock Virtualization Mechanism for Exploiting Fine-grained Parallelism in Graph Traversal Algorithms. In Proc. of Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2013.

Digital Library

[26]

A. Yilmazer and D. Kaeli. HQL: A Scalable Synchronization Mechanism for GPUs. In Proc. of IEEE 27th International Symposium on Parallel Distributed Processing (IPDPS), 2013.

Digital Library

Cited By

Jangda AMaleki SDehnavi MMusuvathi MSaarikivi O(2024)A Framework for Fine-Grained Synchronization of Dependent GPU Kernels2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444873(93-105)Online publication date: 2-Mar-2024
https://doi.org/10.1109/CGO57630.2024.10444873
Mustafa DAlkhasawneh RObeidat FShatnawi A(2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3372990
Gao LWang JZhang W(2022)Adaptive Contention Management for Fine-Grained Synchronization on Commodity GPUsACM Transactions on Architecture and Code Optimization10.1145/354730119:4(1-21)Online publication date: 16-Sep-2022
https://dl.acm.org/doi/10.1145/3547301
Show More Cited By

Index Terms

Lock-based synchronization for GPU architectures
1. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language features
        Concurrent programming structures
2. Theory of computation
  1. Design and analysis of algorithms
    1. Parallel algorithms
      1. Shared memory algorithms

Recommendations

Software Transactional Memory for GPU Architectures
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Modern GPUs have shown promising results in accelerating computation intensive and numerical workloads with limited dynamic data sharing. However, many real-world applications manifest ample amount of data sharing among concurrently executing threads. ...
Pervasive massively multithreaded GPU processors
CF '09: Proceedings of the 6th ACM conference on Computing frontiers

This talk presents an overview of NVIDIA's SIMT architecture and some brief insights on how some CUDA programming paradigms map onto it. A brief history of SIMT is provided to explain how NVIDIA ended up implementing a unified SIMT processor core in its ...
NVIDIA Tesla: A Unified Graphics and Computing Architecture

To enable flexible, programmable graphics and high-performance computing, NVIDIA has developed the Tesla scalable unified graphics and parallel computing architecture. Its scalable parallel array of processors is massively multithreaded and programmable ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CF '16: Proceedings of the ACM International Conference on Computing Frontiers

May 2016

487 pages

ISBN:9781450341288

DOI:10.1145/2903150

General Chairs:
Gianluca Palermo
Politecnico di Milano, IT
,
John Feo
Pacific Northwest National Laboratory and Northwest Institute for Advanced Computing
,
Program Chairs:
Antonino Tumeo
Pacific Northwest National Laboratory, USA
,
Hubertus Franke
New York University and IBM Research, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Micron Foundation: Micron Technology Foundation, Inc.
ACM: Association for Computing Machinery
Politecnico di Milano: Politecnico di Milano
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
IBM: IBM

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 May 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CF'16

Sponsor:

Micron Foundation
ACM
Politecnico di Milano
SIGMICRO
IBM

CF'16: Computing Frontiers Conference

May 16 - 19, 2016

Como, Italy

Acceptance Rates

CF '16 Paper Acceptance Rate 30 of 94 submissions, 32%;

Overall Acceptance Rate 273 of 785 submissions, 35%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
573
Total Downloads

Downloads (Last 12 months)36
Downloads (Last 6 weeks)2

Reflects downloads up to 18 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jangda AMaleki SDehnavi MMusuvathi MSaarikivi O(2024)A Framework for Fine-Grained Synchronization of Dependent GPU Kernels2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444873(93-105)Online publication date: 2-Mar-2024
https://doi.org/10.1109/CGO57630.2024.10444873
Mustafa DAlkhasawneh RObeidat FShatnawi A(2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3372990
Gao LWang JZhang W(2022)Adaptive Contention Management for Fine-Grained Synchronization on Commodity GPUsACM Transactions on Architecture and Code Optimization10.1145/354730119:4(1-21)Online publication date: 16-Sep-2022
https://dl.acm.org/doi/10.1145/3547301
Miller dNelson JHassan APalmieri RWassermann BMalka MChidambaram VRaz D(2021)KVCGProceedings of the 14th ACM International Conference on Systems and Storage10.1145/3456727.3463779(1-12)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1145/3456727.3463779
Zhou HTroendle DJang B(2021)DACHash: A Dynamic, Cache-Aware and Concurrent Hash Table on GPUs2021 IEEE 33rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD53543.2021.00012(1-10)Online publication date: Oct-2021
https://doi.org/10.1109/SBAC-PAD53543.2021.00012
Yilmazer‐Metin A(2021)sRSP: An efficient and scalable implementation of remote scope promotionConcurrency and Computation: Practice and Experience10.1002/cpe.648334:9Online publication date: 11-Jul-2021
https://doi.org/10.1002/cpe.6483
Gao LXu YWang RLuan ZYu ZQian D(2020)Thread-Level Locking for SIMT ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.295570531:5(1121-1136)Online publication date: 1-May-2020
https://doi.org/10.1109/TPDS.2019.2955705
Nelson JMiller dPalmieri R(2020)Don't forget about synchronization! Guidelines for using locks on graphics processing unitsConcurrency and Computation: Practice and Experience10.1002/cpe.575734:2Online publication date: 13-Apr-2020
https://doi.org/10.1002/cpe.5757
Nelson JPalmieri R(2019)Don't Forget About Synchronization!Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3303084.3309488(11-20)Online publication date: 17-Feb-2019
https://dl.acm.org/doi/10.1145/3303084.3309488
Wang KFussell DLin CBahar IHerlihy MWitchel ELebeck A(2019)Fast Fine-Grained Global Synchronization on GPUsProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304055(793-806)Online publication date: 4-Apr-2019
https://dl.acm.org/doi/10.1145/3297858.3304055
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents