tutorial

Software Transactional Memory for GPU Architectures

Authors:

Nilanjan Goswami,

Depei QianAuthors Info & Claims

CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Pages 1 - 10

https://doi.org/10.1145/2581122.2544139

Published: 15 February 2014 Publication History

Abstract

Modern GPUs have shown promising results in accelerating computation intensive and numerical workloads with limited dynamic data sharing. However, many real-world applications manifest ample amount of data sharing among concurrently executing threads. Often data sharing requires mutual exclusion mechanism to ensure data integrity in multithreaded environment. Although modern GPUs provide atomic primitives that can be leveraged to construct fine-grained locks, lock-based synchronization requires significant programming efforts to achieve functional correctness. The massive multithreading and SIMT execution paradigm of GPUs further extend the challenges of GPU locks.

To make applications with dynamic data sharing benefit from GPU acceleration, we propose a novel software transactional memory system for GPU architectures (GPU-STM). The major challenges include ensuring good scalability with respect to the massive multithreading of GPUs, and preventing livelocks caused by the SIMT execution paradigm of GPUs. To this end, we propose (1) a hierarchical validation technique and (2) an encounter-time lock-sorting mechanism to deal with the two challenges, respectively. We build our GPU-STM prototype based on the commercially available GPU platform and runtime. Our real system based evaluation shows that GPU-STM outperforms coarse-grain locks on GPUs by up to 20x.

References

[1]

Khronos OpenCL, http://www.khronos.org/opencl/, 2013.

[2]

C. Blundell, E. C. Lewis, and M. M. K. Martin. Subtleties of transactional memory atomicity semantics. IEEE Computer Architecture Letters (CAL), 5(2), 2006.

Digital Library

[3]

D. Cederman, P. Tsigas, and M. T. Chaudhry. Towards a Software Transactional Memory for Graphics Processors. In Proc. of the Eurographics Symp. on Parallel Graphics and Visualization (EGPGV), 2010.

Digital Library

[4]

L. Dalessandro and M. L. Scott. Sandboxing Transactional Memory. In Proc. of 21th Intl. Conf. on Parallel Architectures and Compilation Techniques (PACT), 2012.

Digital Library

[5]

L. Dalessandro, M. F. Spear, and M. L. Scott. NOrec: Streamlining STM by Abolishing Ownership Records. In Proc. of the 15th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming (PPoPP), pages 67--78, 2010.

Digital Library

[6]

D. Dice, O. Shalev, and N. Shavit. Transactional Locking II. In Proc. of the 20th Intl. Symp. on Distributed Computing (DISC), pages 194--208, 2006.

Digital Library

[7]

W. W. L. Fung, I. Singh, A. Brownsword, and T. M. Aamodt. Hardware Transactional Memory for GPU Architectures. In Proc. of the 44th Annual IEEE/ACM Intl. Symp. on Microarchitecture (MICRO), pages 296--307, 2011.

Digital Library

[8]

R. Guerraoui and M. Kapalka. On the Correctness of Transactional Memory. In Proc. of the 13th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming (PPoPP), pages 175--184, 2008.

Digital Library

[9]

P. Harish, V. Vineet, and P. J. Narayanan. Large graph algorithms for massively multithreaded architectures. Technical Report IIIT/TR/2009/74, IIIT Hyderabad, INDIA, 2009.

[10]

B. He and J. X. Yu. High-throughput transaction executions on graphics processors. In Proc. of the VLDB Endowment (PVLDB), pages 314--325, 2011.

Digital Library

[11]

M. Herlihy and J. E. B. Moss. Transactional memory: architectural support for lock-free data structures. In Proc. 20th Annual Intl. Symp. on Computer Architecture (ISCA), pages 289--300, 1993.

Digital Library

[12]

S. Hong, T. Oguntebi, J. Casper, N. Bronson, C. Kozyrakis, and K. Olukotun. Eigenbench: A simple exploration tool for orthogonal TM characteristics. In Proc. of IEEE Intl. Symp. on Workload Characterization (IISWC), 2010.

Digital Library

[13]

D. B. Lomet. Process structuring, synchronization, and recovery using atomic actions. In Proc. of the ACM Conference on Language Design for Reliable Software, pages 128--137, 1977.

Digital Library

[14]

M. Mendez-Lojo, M. Burtscher, and K. Pingali. A GPU Implementation of Inclusion-based Points-to Analysis. In Proc. of the 17th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming (PPoPP), pages 107--116, 2012.

Digital Library

[15]

C. C. Minh, J. Chung, C. Kozyrakis, and K. Olukotun. STAMP: Stanford Transactional Applications for Multi-Processing. In Proc. of IEEE Intl. Symp. on Workload Characterization (IISWC), 2008.

[16]

R. Nasre, M. Burtscher, and K. Pingali. Atomic-free Irregular Computations on GPUs. In Proc. of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (GPGPU), pages 96--107, 2013.

Digital Library

[17]

NVIDIA CUDA. CUDA C Programming Guide Version 4.2.

[18]

M. Olszewski, J. Cutler, and J. G. Steffan. JudoSTM: A Dynamic Binary-Rewriting Approach to Software Transactional Memory. In Proc. of 16th Intl. Conf. on Parallel Architecture and Compilation Techniques (PACT), 2007.

[19]

A. Ramamurthy. Towards Scalar Synchronization in SIMT Architectures. Master's thesis, University of British Columbia, 2011.

[20]

J. Sanders and E. Kandrot. CUDA by Example, An Introduction to General Purpose GPU Programming, chapter 9. Addison-Wesley Professional, 2010.

Digital Library

[21]

N. Shavit and D. Touitou. Software transactional memory. In Proc. of the 14th ACM Symp. on Principles of Distributed Computing (PODC), pages 204--213, 1995.

Digital Library

[22]

M. F. Spear, M. M. Michael, M. L. Scott, and P. Wu. Reducing Memory Ordering Overheads in Software Transactional Memory. In Proc. of the 7th annual IEEE/ACM Intl. Symp. on Code Generation and Optimization (CGO), pages 13--24, 2009.

Digital Library

[23]

Y. Xu, R. Wang, N. Goswami, T. Li, and D. Qian. Software Transactional Memory for GPU Architectures. IEEE Computer Architecture Letters (CAL), 2013.

Cited By

Salamanca JBaldassin A(2024)Using Hardware-Transactional-Memory Support to Implement Speculative Task ExecutionJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104939(104939)Online publication date: Jun-2024
https://doi.org/10.1016/j.jpdc.2024.104939
Zhang WZhao CPeng LLin YZhang FLu YDehnavi MKulkarni MKrishnamoorthy S(2023)Boosting Performance and QoS for Concurrent GPU B+trees by Combining-Based SynchronizationProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577474(1-13)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577474
Zhang WZhao CPeng LLin YZhang FJiang JLee JAgrawal KSpear M(2022)High performance GPU concurrent B+treeProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508419(443-444)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1145/3503221.3508419
Show More Cited By

Index Terms

Software Transactional Memory for GPU Architectures
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

Software Transactional Memory for GPU Architectures
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Modern GPUs have shown promising results in accelerating computation intensive and numerical workloads with limited dynamic data sharing. However, many real-world applications manifest ample amount of data sharing among concurrently executing threads. ...
Hardware transactional memory for GPU architectures
MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture

Graphics processor units (GPUs) are designed to efficiently exploit thread level parallelism (TLP), multiplexing execution of 1000s of concurrent threads on a relatively smaller set of single-instruction, multiple-thread (SIMT) cores to hide various ...
An efficient software transactional memory using commit-time invalidation
CGO '10: Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization

To improve the performance of transactional memory (TM), researchers have found many eager and lazy optimizations for conflict detection, the process of determining if transactions can commit. Despite these optimizations, nearly all TMs perform one ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

February 2014

328 pages

ISBN:9781450326704

DOI:10.1145/2581122

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 February 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Tutorial
Research
Refereed limited

Conference

CGO '14

Sponsor:

CGO '14: 12th Annual IEEE/ACM International Symposium on Code Generation and Optimization

February 15 - 19, 2014

FL, Orlando, USA

Acceptance Rates

CGO '14 Paper Acceptance Rate 29 of 100 submissions, 29%;

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
571
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Salamanca JBaldassin A(2024)Using Hardware-Transactional-Memory Support to Implement Speculative Task ExecutionJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104939(104939)Online publication date: Jun-2024
https://doi.org/10.1016/j.jpdc.2024.104939
Zhang WZhao CPeng LLin YZhang FLu YDehnavi MKulkarni MKrishnamoorthy S(2023)Boosting Performance and QoS for Concurrent GPU B+trees by Combining-Based SynchronizationProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577474(1-13)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577474
Zhang WZhao CPeng LLin YZhang FJiang JLee JAgrawal KSpear M(2022)High performance GPU concurrent B+treeProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508419(443-444)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1145/3503221.3508419
Gao LXu YWang RLuan ZYu ZQian D(2020)Thread-Level Locking for SIMT ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.295570531:5(1121-1136)Online publication date: 1-May-2020
https://doi.org/10.1109/TPDS.2019.2955705
Chen SZhang FLiu LPeng L(2019)Efficient GPU NVRAM Persistence with Helper WarpsProceedings of the 56th Annual Design Automation Conference 201910.1145/3316781.3317810(1-6)Online publication date: 2-Jun-2019
https://dl.acm.org/doi/10.1145/3316781.3317810
Wang KFussell DLin CBahar IHerlihy MWitchel ELebeck A(2019)Fast Fine-Grained Global Synchronization on GPUsProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304055(793-806)Online publication date: 4-Apr-2019
https://dl.acm.org/doi/10.1145/3297858.3304055
Awad MAshkiani SJohnson RFarach-Colton MOwens JHollingsworth JKeidar I(2019)Engineering a high-performance GPU B-TreeProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3295706(145-157)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.1145/3293883.3295706
Castro DRomano PIlic AKhan A(2019)HeTM: Transactional Memory for Heterogeneous Systems2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT.2019.00026(232-244)Online publication date: Sep-2019
https://doi.org/10.1109/PACT.2019.00026
Blaß TPhilippsen MVeldema R(2019)Efficient Inspected Critical Sections in Data-Parallel GPU CodesLanguages and Compilers for Parallel Computing10.1007/978-3-030-35225-7_15(223-239)Online publication date: 15-Nov-2019
https://doi.org/10.1007/978-3-030-35225-7_15
Irving SChen SPeng LBusch CHerlihy MMichael C(2019)CUDA-DTM: Distributed Transactional Memory for GPU ClustersNetworked Systems10.1007/978-3-030-31277-0_12(183-199)Online publication date: 14-Sep-2019
https://doi.org/10.1007/978-3-030-31277-0_12
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents