research-article

GPU Concurrency: Weak Behaviours and Programming Assumptions

Authors:

Jade Alglave,

Mark Batty,

Alastair F. Donaldson,

Ganesh Gopalakrishnan,

Jeroen Ketema,

Daniel Poetzl,

Tyler Sorensen,

John WickersonAuthors Info & Claims

ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems

Pages 577 - 591

https://doi.org/10.1145/2694344.2694391

Published: 14 March 2015 Publication History

Get Access

Abstract

Concurrency is pervasive and perplexing, particularly on graphics processing units (GPUs). Current specifications of languages and hardware are inconclusive; thus programmers often rely on folklore assumptions when writing software.

To remedy this state of affairs, we conducted a large empirical study of the concurrent behaviour of deployed GPUs. Armed with litmus tests (i.e. short concurrent programs), we questioned the assumptions in programming guides and vendor documentation about the guarantees provided by hardware. We developed a tool to generate thousands of litmus tests and run them under stressful workloads. We observed a litany of previously elusive weak behaviours, and exposed folklore beliefs about GPU programming---often supported by official tutorials---as false.

As a way forward, we propose a model of Nvidia GPU hardware, which correctly models every behaviour witnessed in our experiments. The model is a variant of SPARC Relaxed Memory Order (RMO), structured following the GPU concurrency hierarchy.

References

[1]

Online companion material. http://virginia.cs.ucl.ac.uk/sunflowers/asplos15/.

Google Scholar

[2]

GPUBench, June 2014. http://graphics.stanford.edu/projects/gpubench.

Google Scholar

[3]

J. Alglave. A Shared Memory Poetics. PhD thesis, Université Paris 7, 2010.

Google Scholar

[4]

J. Alglave. A formal hierarchy of weak memory models. Formal Methods in System Design (FMSD), 41(2):178--210, 2012.

Digital Library

Google Scholar

[5]

J. Alglave, L. Maranget, S. Sarkar, and P. Sewell. Litmus: Running tests against hardware. In Tools and Algorithms for the Construction and Analysis of Systems (TACAS), pages 41--44, 2011.

Digital Library

Google Scholar

[6]

J. Alglave, L. Maranget, S. Sarkar, and P. Sewell. Fences in weak memory models (extended version). Formal Methods in System Design (FMSD), 40(2):170--205, 2012.

Digital Library

Google Scholar

[7]

J. Alglave, L. Maranget, and M. Tautschnig. Herding cats: Modelling, simulation, testing, and data mining for weak memory. ACM Transactions on Programming Languages and Systems (TOPLAS), 36(2):7, 2014.

Digital Library

Google Scholar

[8]

AMD. AMD intermediate language (IL), version 2.4, Oct. 2011.

Google Scholar

[9]

AMD. Evergreen family instruction set architecture: Instructions and microcode, revision 1.1a, Nov. 2011.

Google Scholar

[10]

AMD. Southern Islands series instruction set architecture, revision 1.1, Dec. 2012.

Google Scholar

[11]

AMD. AMD accelerated parallel processing OpenCL programming guide, Nov. 2013.

Google Scholar

[12]

ARM. Cortex-A9 MPCore, programmer advice notice, read-after-read hazards ARM reference 761319, Sept. 2011.

Google Scholar

[13]

N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 119--129, 1998.

Digital Library

Google Scholar

[14]

H. Boehm and S. V. Adve. Foundations of the C++ concurrency memory model. In Programming Language Design and Implementation (PLDI), pages 68--78, 2008.

Digital Library

Google Scholar

[15]

D. Cederman and P. Tsigas. On dynamic load balancing on graphics processors. In SIGGRAPH/Eurographics, pages 57--64, 2008.

Digital Library

Google Scholar

[16]

D. Cederman and P. Tsigas. Dynamic load balancing on graphics processors, Feb. 2014. http://www.cse.chalmers.se/research/group/dcs/gpuloadbal.html.

Digital Library

Google Scholar

[17]

W. Collier. Reasoning About Parallel Architectures. Prentice-Hall, 1992.

Digital Library

Google Scholar

[18]

E. Eide and J. Regehr. Volatiles are miscompiled, and what to do about it. In Embedded software (EMSOFT), pages 255--264, 2008.

Digital Library

Google Scholar

[19]

W. Feng and S. Xiao. To GPU synchronize or not GPU synchronize? In International Symposium on Circuits and Systems (ISCAS), pages 3801--3804, 2010.

Google Scholar

[20]

A. Habermaier and A. Knapp. On the correctness of the SIMT execution model of GPUs. In European Symposium on Programming (ESOP), pages 316--335, 2012.

Digital Library

Google Scholar

[21]

T. Härder and A. Reuter. Principles of transaction-oriented database recovery. Computing Survey (CSUR), 15(4):287--317, 1983.

Digital Library

Google Scholar

[22]

B. He and J. X. Yu. High-throughput transaction executions on graphics processors. The Proceedings of the VLDB Endowment (PVLDB), 4(5):314--325, 2011.

Digital Library

Google Scholar

[23]

B. A. Hechtman and D. J. Sorin. Exploring memory consistency for massively-threaded throughput-oriented processors. In International Symposium on Circuits and Systems (ISCA), pages 201--212, 2013.

Digital Library

Google Scholar

[24]

D. R. Hower, B. M. Beckmann, B. R. Gaster, B. A. Hechtman, M. D. Hill, S. K. Reinhardt, and D. A. Wood. Sequential consistency for heterogeneous-race-free. In Memory Systems Performance and Correctness (MSPC), 2013.

Google Scholar

[25]

D. R. Hower, B. A. Hechtman, B. M. Beckmann, B. R. Gaster, M. D. Hill, S. K. Reinhardt, and D. A. Wood. Heterogeneous- race-free memory models. In Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 427--440, 2014.

Digital Library

Google Scholar

[26]

W.-m. W. Hwu. GPU Computing Gems Jade Edition. Morgan Kaufmann Publishers Inc., 2011.

Digital Library

Google Scholar

[27]

Khronos OpenCL Working Group. The OpenCL specification (version 1.2, revision 19), Nov. 2012.

Google Scholar

[28]

L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, 28(9):690--691, 1979.

Digital Library

Google Scholar

[29]

S. Lee, Y. Kim, J. Kim, and J. Kim. Stealing webpages rendered on your browser by exploiting GPU vulnerabilities. In Symposium on Security and Privacy (SP), pages 19--33, 2014.

Digital Library

Google Scholar

[30]

P. Misra and M. Chaudhuri. Performance evaluation of concurrent lock-free data structures on GPUs. In International Conference on Parallel and Distributed Systems, (ICPADS), pages 53--60, 2012.

Digital Library

Google Scholar

[31]

R. Morisset, P. Pawan, and F. Z. Nardelli. Compiler testing via a theory of sound optimisations in the C11/C++11 memory model. In Programming Language Design and Implementation (PLDI), pages 187--196, 2013.

Digital Library

Google Scholar

[32]

Nvidia. CUDA C programming guide, version 5.5, July 2013.

Google Scholar

[33]

Nvidia. CUDA by example -- errata, June 2014. http://developer.nvidia.com/cuda-example-errata-page.

Google Scholar

[34]

Nvidia. CUDA C programming guide, version 6, July 2014.

Google Scholar

[35]

Nvidia. CUDA binary utilities, Aug. 2014. http://docs.nvidia.com/cuda/pdf/CUDA_Binary_Utilities.pdf.

Google Scholar

[36]

Nvidia. Parallel thread execution ISA: Version 4.0, Feb. 2014.

Google Scholar

[37]

S. Owens, S. Sarkar, and P. Sewell. A better x86 memory model: x86-tso. In Theorem Proving in Higher Order Logics (TPHOLs), pages 391--407, 2009.

Digital Library

Google Scholar

[38]

J. Sanders and E. Kandrot. CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley Professional, 2010.

Digital Library

Google Scholar

[39]

S. Sarkar, P. Sewell, J. Alglave, L. Maranget, and D. Williams. Understanding POWER multiprocessors. In Programming Language Design and Implementation (PLDI), pages 175--186, 2011.

Digital Library

Google Scholar

[40]

T. Sorensen. Towards shared memory consistency models for GPUs. Bachelor's thesis, University of Utah, 2013.

Google Scholar

[41]

T. Sorensen, G. Gopalakrishnan, and V. Grover. Towards shared memory consistency models for GPUs. In International Conference on Supercomputing (ICS), pages 489--490, 2013.

Digital Library

Google Scholar

[42]

J. A. Stuart and J. D. Owens. Efficient synchronization primitives for GPUs. Computing Research Repository (CoRR), abs/1110.4623, 2011. http://arxiv.org/pdf/1110.4623.pdf.

Google Scholar

[43]

D. L. Weaver and T. Germond. The SPARC Architecture Manual Version 9. SPARC International Inc., 1994.

Google Scholar

[44]

H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. Demystifying GPU microarchitecture through microbenchmarking. In Performance Analysis of Systems Software (ISPASS), pages 235--246, 2010.

Crossref

Google Scholar

[45]

S. Xiao and W. Feng. Inter-block GPU communication via fast barrier synchronization. In International Symposium on Parallel and Distributed Processing (IPDPS), pages 1--12, 2010.

Google Scholar

Cited By

View all

Geeson LSmith L(2024)Compiler Testing with Relaxed Memory Models2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444836(334-348)Online publication date: 2-Mar-2024
https://doi.org/10.1109/CGO57630.2024.10444836
Tabbakh AAnnavaram M(2024)An efficient sequential consistency implementation with dynamic race detection for GPUsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104836187:COnline publication date: 1-May-2024
https://dl.acm.org/doi/10.1016/j.jpdc.2023.104836
Levine RCho MMcKee DQuinn ASorensen TJust RFraser G(2023)GPUHarbor: Testing GPU Memory Consistency at Large (Experience Paper)Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598095(779-791)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3597926.3598095
Show More Cited By

Index Terms

GPU Concurrency: Weak Behaviours and Programming Assumptions
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory

Recommendations

GPU Concurrency: Weak Behaviours and Programming Assumptions
ASPLOS'15

Concurrency is pervasive and perplexing, particularly on graphics processing units (GPUs). Current specifications of languages and hardware are inconclusive; thus programmers often rely on folklore assumptions when writing software.

To remedy this state ...
GPU Concurrency: Weak Behaviours and Programming Assumptions
ASPLOS '15

Concurrency is pervasive and perplexing, particularly on graphics processing units (GPUs). Current specifications of languages and hardware are inconclusive; thus programmers often rely on folklore assumptions when writing software.

To remedy this state ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...

Reviews

Reviewer: Karthik S Murthy

A memory consistency model (MCM) is a specification that describes the value(s) that a memory location should hold based on the causal history of operations that may or may not be associated with that location. The MCM specification is important because the correctness of a program depends on knowing what a thread reads/writes to a memory location in the presence of concurrent accesses by other threads. This specification exists at multiple levels: between a programmer and a compiler for a language, between a compiler and the processor/hardware on which it runs, between multiple hardware components, and so on. Considering this sandwich of specifications and the number of optimizations that many compilers/hardware do today, writing deterministic and performant parallel programs is a challenge for an expert or beginner programmer. Efforts to thoroughly understand and find bugs in MCMs have spanned multiple decades [1,2,3]. This paper plays a crucial role in understanding one such specification: the one between a programmer and the general-purpose graphics processing unit (GPGPU) hardware. In the paper, the authors describe that the observed consistency model on the GPU hardware is that of weak behavior, that is, a relaxed consistency model where operations can be reordered during execution leading to a nonsequentially consistent execution unless appropriate steps are taken such as using fences. They justify this stand by building a set of tests-litmus tests-that help capture this behavior. They then suggest fixes to obtain the required behavior when using such code snippets. They also produce a tool, optcheck, which can inspect GPU assembly code for reorderings that might change the behavior of the litmus tests. Finally, they present a formal model that can help build a simulation tool, proposed not implemented, which would predict possible behaviors of PTX fragments. I do have a few gripes about the paper. The authors do not indicate whether any out-of-thin-air reads were observed in the CoRR experiments. Furthermore, they justify the necessity of fences in Figure 6, but an argument about sufficiency is absent. Overall, the paper is very well written and easily understandable. Of particular mention is the way the authors defend the lack of clarity in vendor specifications via excerpts from such specifications that contradict each other. Finally, Table 2 depicting the behaviors identified by their study is a must-read for any GPU programmer. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems

March 2015

720 pages

ISBN:9781450328357

DOI:10.1145/2694344

General Chairs:
Ozcan Ozturk
Bilkent University, Turkey
,
Kemal Ebcioglu
Global Supercomputing, USA
,
Program Chair:
Sandhya Dwarkadas
University of Rochester, USA

ACM SIGPLAN Notices Volume 50, Issue 4
ASPLOS '15
April 2015
676 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2775054
Editor:
Andy Gill
University of Kansas, Lawrence, KS
Issue’s Table of Contents
ACM SIGARCH Computer Architecture News Volume 43, Issue 1
ASPLOS'15
March 2015
676 pages
ISSN:0163-5964
DOI:10.1145/2786763
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 March 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF CCF
SRC
EPSRC
EU FP7 project CARP

Conference

ASPLOS '15

Sponsor:

ASPLOS '15: Architectural Support for Programming Languages and Operating Systems

March 14 - 18, 2015

Istanbul, Turkey

Acceptance Rates

ASPLOS '15 Paper Acceptance Rate 48 of 287 submissions, 17%;

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

93
Total Citations
View Citations
1,191
Total Downloads

Downloads (Last 12 months)124
Downloads (Last 6 weeks)4

Reflects downloads up to 04 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Geeson LSmith L(2024)Compiler Testing with Relaxed Memory Models2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444836(334-348)Online publication date: 2-Mar-2024
https://doi.org/10.1109/CGO57630.2024.10444836
Tabbakh AAnnavaram M(2024)An efficient sequential consistency implementation with dynamic race detection for GPUsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104836187:COnline publication date: 1-May-2024
https://dl.acm.org/doi/10.1016/j.jpdc.2023.104836
Levine RCho MMcKee DQuinn ASorensen TJust RFraser G(2023)GPUHarbor: Testing GPU Memory Consistency at Large (Experience Paper)Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598095(779-791)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3597926.3598095
Levine RGuo TCho MBaker ALevien RNeto DQuinn ASorensen TAamodt TJerger NSwift M(2023)MC Mutants: Evaluating and Improving Testing for Memory Consistency SpecificationsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575750(473-488)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575750
Klimis VClark JBaker ANeto DWickerson JDonaldson A(2023)Taking Back Control in an Intermediate Representation for GPU ComputingProceedings of the ACM on Programming Languages10.1145/35712537:POPL(1740-1769)Online publication date: 11-Jan-2023
https://dl.acm.org/doi/10.1145/3571253
Graf MAndrade Gdos Santos L(2023)EveCheck: An Event-Driven, Scalable Algorithm for Coherent Shared Memory VerificationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.317805142:2(683-696)Online publication date: Feb-2023
https://doi.org/10.1109/TCAD.2022.3178051
Boeschen NBinnig CIves ZBonifati AEl Abbadi A(2022)GaccO - A GPU-accelerated OLTP DBMSProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517876(1003-1016)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517876
Dawson SBatty MSung AVerma RYap R(2022)Chronos vs. ChaosProceedings of the 2022 ACM on International Workshop on Security and Privacy Analytics10.1145/3510548.3519371(68-77)Online publication date: 18-Apr-2022
https://dl.acm.org/doi/10.1145/3510548.3519371
Lustig DCooksey SGiroux OSalapura VZahran MChong FTang L(2022)Mixed-proxy extensions for the NVIDIA PTX memory consistency modelProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3533045(1058-1070)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3533045
Sorensen TSalvador LRaval HEvrard HWickerson JMartonosi MDonaldson A(2021)Specifying and testing GPU workgroup progress modelsProceedings of the ACM on Programming Languages10.1145/34855085:OOPSLA(1-30)Online publication date: 15-Oct-2021
https://dl.acm.org/doi/10.1145/3485508
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations