research-article

Open access

LD: Low-Overhead GPU Race Detection Without Access Monitoring

Authors:

Chen DingAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 14, Issue 1

Article No.: 9, Pages 1 - 25

https://doi.org/10.1145/3046678

Published: 21 March 2017 Publication History

Abstract

Data race detection has become an important problem in GPU programming. Previous designs of CPU race-checking tools are mainly task parallel and incur high overhead on GPUs due to access instrumentation, especially when monitoring many thousands of threads routinely used by GPU programs.

This article presents a novel data-parallel solution designed and optimized for the GPU architecture. It includes compiler support and a set of runtime techniques. It uses value-based checking, which detects the races reported in previous work, finds new races, and supports race-free deterministic GPU execution. More important, race checking is massively data parallel and does not introduce divergent branching or atomic synchronization. Its slowdown is less than 5 × for over half of the tests and 10 × on average, which is orders of magnitude more efficient than the cuda-memcheck tool by Nvidia and the methods that use fine-grained access instrumentation.

Supplementary Material

TACO1401-09 (taco1401-09.pdf)

Slide deck associated with this paper

Download
1.40 MB

References

[1]

John R. Allen and Ken Kennedy. 2001. Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann Publishers.

[2]

Raghesh Aloor and V. Krishna Nandivada. 2015. Unique worker model for OpenMP. In Proceedings of the International Conference on Supercomputing. 47--56.

Digital Library

[3]

Cristiana Amza, Alan L. Cox, Sandhya Dwarkadas, Peter J. Keleher, Honghui Lu, Ramakrishnan Rajamony, Weimin Yu, and Willy Zwaenepoel. 1996. TreadMarks: Shared memory computing on networks of workstations. IEEE Computer 29, 2 (1996), 18--28.

Digital Library

[4]

Tongxin Bai, Chen Ding, and Pengcheng Li. 2015. Assessing safe task parallelism in SPEC 2006 INT. In Proceedings of the 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

Digital Library

[5]

Ethel Bardsley and Alastair F. Donaldson. 2014. Warps and atomics: Beyond barrier synchronization in the verification of GPU kernels. In Proceedings of the 6th International Symposium on NASA Formal Methods (NFM’14).

Digital Library

[6]

Tom Bergan, Owen Anderson, Joseph Devietti, Luis Ceze, and Dan Grossman. 2010. CoreDet: A compiler and runtime system for deterministic multithreaded execution. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 53--64.

Digital Library

[7]

Emery D. Berger, Ting Yang, Tongping Liu, and Gene Novark. 2009. Grace: Safe multithreaded programming for C/C++. In Proceedings of the International Conference on Object Oriented Programming, Systems, Languages and Applications. 81--96.

Digital Library

[8]

Adam Betts, Nathan Chong, Alastair Donaldson, Shaz Qadeer, and Paul Thomson. 2012. GPUVerify: A verifier for GPU kernels. In Proceedings of the International Conference on Object Oriented Programming, Systems, Languages and Applications. 113--132.

Digital Library

[9]

Michael Boyer, Kevin Skadron, and Westley Weimer. 2008. Automated dynamic analysis of CUDA programs. In Proceedings of the 3rd Workshop on Software Tools for MultiCore Systems.

[10]

Sebastian Burckhardt, Alexandro Baldassin, and Daan Leijen. 2010. Concurrent programming with revisions and isolation types. In Proceedings of the International Conference on Object Oriented Programming, Systems, Languages and Applications. 691--707.

Digital Library

[11]

Wei-Fan Chiang, Ganesh Gopalakrishnan, Guodong Li, and Zvonimir Rakamaric. 2013. Formal analysis of GPU programs with atomics via conflict-directed delay-bounding. In Proceedings of NASA Formal Methods, 5th International Symposium (NFM’13).

[12]

Nathan Chong, Alastair F. Donaldson, and Jeroen Ketema. 2014. A sound and complete abstraction for reasoning about parallel prefix sums. In Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL'14).

Digital Library

[13]

Keith Cooper and Linda Torczon. 2010. Engineering a Compiler (2nd ed.). Morgan Kaufmann.

[14]

Joseph Devietti, Benjamin P. Wood, Karin Strauss, Luis Ceze, Dan Grossman, and Shaz Qadeer. 2012. RADISH: Always-on sound and complete ra detection in software and hardware. In Proceedings of the 39th Annual International Symposium on Computer Architecture.

Digital Library

[15]

Chen Ding, Brian Gernhart, Pengcheng Li, and Matthew Hertz. 2014. Safe Parallel Programming in An Interpreted Language. Technical Report URCS #991. Department of Computer Science, University of Rochester.

[16]

Chen Ding, Xipeng Shen, Kirk Kelsey, Chris Tice, Ruke Huang, and Chengliang Zhang. 2007. Software behavior oriented parallelization. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 223--234.

Digital Library

[17]

Laura Effinger-Dean, Brandon Lucia, Luis Ceze, Dan Grossman, and Hans-J. Boehm. 2012. IFRit: Interference-free regions for dynamic data-race detection. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications.

Digital Library

[18]

Cormac Flanagan and Stephen N. Freund. 2009. FastTrack: Efficient and precise dynamic race detection. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 121--133.

Digital Library

[19]

Junjie Gu, Zhiyuan Li, and Gyungho Lee. 1997. Experience with efficient array data-flow analysis for array privatization. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 157--167.

Digital Library

[20]

Anup Holey, Vineeth Mekkat, and Antonia Zhai. 2013. HAccRG: Hardware-accelerated data race detection in GPUs. In ICPP.

Digital Library

[21]

Qiming Hou, Kun Zhou, and Baining Guo. 2009. Debugging GPU stream programs through automatic dataflow recording and visualization. In ACM SIGGRAPH Asia 2009 Papers.

Digital Library

[22]

Weixing Ji, Li Lu, and Michael L. Scott. 2013. TARDIS: Task-level access race detection by intersecting sets. In Proceedings of the Workshop on Determinism and Correctness in Parallel Programming.

[23]

Hadi Jooybar, Wilson W. L. Fung, Mike O’Connor, Joseph Devietti, and Tor M. Aamodt. 2013. GPUDet: A deterministic GPU architecture. In ASPLOS.

Digital Library

[24]

Chuanle Ke, Lei Liu, Chao Zhang, Tongxin Bai, Bryan Jacobs, and Chen Ding. 2011. Safe parallel programming using dynamic dependence hints. In Proceedings of the International Conference on Object Oriented Programming, Systems, Languages and Applications. 243--258.

Digital Library

[25]

Kirk Kelsey, Tongxin Bai, and Chen Ding. 2009. Fast track: A software system for speculative optimization. In Proceedings of the International Symposium on Code Generation and Optimization. 157--168.

Digital Library

[26]

Olaf Krzikalla. 2011. Scout: A Source-to-Source Translator for SIMD-Optimizations. Proceedings of the https://tu-dresden.de/zih/forschung/projekte/scout/.

[27]

Milind Kulkarni, Martin Burtscher, Calin Casçaval, and Keshav Pingali. 2009. Lonestar: A suite of parallel irregular programs. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’09).

[28]

Alan Leung, Manish Gupta, Yuvraj Agarwal, Rajesh Gupta, Ranjit Jhala, and Sorin Lerner. 2012. Verifying GPU kernels by test amplification. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 383--394.

Digital Library

[29]

Guodong Li and Ganesh Gopalakrishnan. 2010. Scalable SMT-based verification of GPU kernel functions. In Proceedings of the 18th ACM SIGSOFT International Symposium on Foundations of Software Engineering. 187--196.

Digital Library

[30]

Guodong Li, Peng Li, Geof Sawaya, Ganesh Gopalakrishnan, Indradeep Ghosh, and Sreeranga P. Rajan. 2012. GKLEE: Concolic verification and test generation for GPUs. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 215--224.

Digital Library

[31]

Pengcheng Li, Chen Ding, Xiaoyu Hu, and Tolga Soyata. 2014. LDetector: A low overhead race detector for GPU programs. In Proceedings of the 5th Workshop on Determinism and Correctness in Parallel Programming.

[32]

Pengcheng Li, Ziang Hu, and Handong Ye. 2015. Compiler and Method for Global-Scope Basic-Block Reordering. https://www.google.com/patents/US20150040106 US Patent App. 14/445,983.

[33]

Peng Li, Guodong Li, and Ganesh Gopalakrishnan. 2012. Parametric flows: Automated behavior equivalencing for symbolic analysis of races in CUDA programs. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis.

Digital Library

[34]

Peng Li, Guodong Li, and Ganesh Gopalakrishnan. 2014. Practical symbolic race checking of GPU programs. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14).

Digital Library

[35]

Pengcheng Li, Hao Luo, Chen Ding, Ziang Hu, and Handong Ye. 2014. Code layout optimization for defensiveness and politeness in shared cache. In Proceedings of the 2014 43rd International Conference on Parallel Processing. 151--161.

Digital Library

[36]

Zhiyuan Li. 1992. Array privatization for parallel execution of loops. In Proceedings of the International Conference on Supercomputing. 313--322.

Digital Library

[37]

Li Lu, Weixing Ji, and Michael L. Scott. 2014. Dynamic enforcement of determinism in a parallel scripting language. In PLDI.

Digital Library

[38]

Wenjing Ma and Gagan Agrawal. 2010. An integer programming framework for optimizing shared memory use on GPUs. In PACT.

Digital Library

[39]

NVIDIA. 2014. Cuda Memcheck Tool. Retrieved from https://developer.nvidia.com/CUDA-MEMCHECK.

[40]

NVIDIA. 2016. CUDA C Programming Guide. Retrieved from http://docs.nvidia.com/cuda/cuda-c-programming-guide/.

[41]

Dejan Perkovic and Peter J. Keleher. 1996. Online data-race detection via coherency guarantees. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation.

Digital Library

[42]

Dejan Perkovic and Peter J. Keleher. 2000. A protocol-centric approach to on-the-fly race detection. IEEE Transactions on Parallel and Distributed Systems 11, 10 (2000), 1058--1072.

Digital Library

[43]

Arun Raman, Hanjun Kim, Thomas R. Mason, Thomas B. Jablin, and David I. August. 2010. Speculative parallelization using software multi-threaded transactions. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 65--76.

Digital Library

[44]

Raghavan Raman, Jisheng Zhao, Vivek Sarkar, Martin T. Vechev, and Eran Yahav. 2012. Scalable and precise dynamic datarace detection for structured parallelism. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation.

Digital Library

[45]

Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. 1997. Eraser: A dynamic data race detector for multi-threaded programs. In Proceedings of the 16th ACM Symposium on Operating Systems Principles.

Digital Library

[46]

Michael L. Scott. 2013. Shared-Memory Synchronization. Morgan 8 Claypool Publishers.

Digital Library

[47]

John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. IEEE Design Test 12, 3 (2010), 66--72.

[48]

Chen Tian, Min Feng, and Rajiv Gupta. 2010. Supporting speculative parallelization in the presence of dynamic data structures. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 62--73.

Digital Library

[49]

UIUC. 2012. The Parboil Benchmark Suite. Retrieved from http://impact.crhc.illinois.edu/parboil/parboil.aspx.

[50]

Leslie G. Valiant. 1990. A bridging model for parallel computation. Communications of the ACM 33, 8 (Aug. 1990), 103--111.

Digital Library

[51]

Kaushik Veeraraghavan, Dongyoon Lee, Benjamin Wester, Jessica Ouyang, Peter M. Chen, Jason Flinn, and Satish Narayanasamy. 2011. DoublePlay: Parallelizing sequential logging and replay. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 15--26.

Digital Library

[52]

Hongtao Yu, Hou-Jen Ko, and Zhiyuan Li. 2013. General data structure expansion for multi-threading. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 243--252.

Digital Library

[53]

Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. 2011. On-the-fly elimination of dynamic irregularities for GPU computing. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 369--380.

Digital Library

[54]

Mai Zheng, Vignesh T. Ravi, Feng Qin, and Gagan Agrawal. 2011. GRace: A low-overhead mechanism for detecting data races in GPU programs. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 135--146.

Digital Library

[55]

Mai Zheng, Vignesh T. Ravi, Feng Qin, and Gagan Agrawal. 2014. GMRace: Detecting data races in GPU programs via a low-overhead scheme. IEEE Transactions on Parallel and Distributed Systems 25 (2014), 104--115.

Digital Library

[56]

Pin Zhou, Radu Teodorescu, and Yuanyuan Zhou. 2007. HARD: Hardware-assisted lockset-based race detection. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

Digital Library

Cited By

Luz HSouza PSouza S(2024)Structural testing for CUDA programming modelConcurrency and Computation: Practice and Experience10.1002/cpe.810536:14Online publication date: 9-Apr-2024
https://doi.org/10.1002/cpe.8105
Cogumbreiro TLange JLiew DZicarelli H(2023)Memory access protocols: certified data-race freedom for GPU kernelsFormal Methods in System Design10.1007/s10703-023-00415-0Online publication date: 26-May-2023
https://doi.org/10.1007/s10703-023-00415-0
Liew DCogumbreiro TLange J(2022)Provable GPU Data-Races in Static Race DetectionElectronic Proceedings in Theoretical Computer Science10.4204/EPTCS.356.4356(36-45)Online publication date: 24-Mar-2022
https://doi.org/10.4204/EPTCS.356.4
Show More Cited By

Index Terms

LD: Low-Overhead GPU Race Detection Without Access Monitoring
1. Computing methodologies
  1. Computer graphics
    1. Graphics systems and interfaces
      1. Graphics processors
2. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software performance

Recommendations

PTI-GPU: Kernel Profiling and Assessment on Intel GPUs
SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

Modern supercomputing applications are complex programs built on optimized frameworks and accelerated on GPUs. As such, dedicated tools for profiling GPU kernel utilization and performance are needed to support development of these applications, which ...
Low overhead symmetrical protection of reusable IP core using robust fingerprinting and watermarking during high level synthesis

Intellectual Property (IP) core used in computing system-on-chip provides a unique blend of yielding enhanced design productivity with reduced design cycle time. However, leveraging benefits of IP core require protection against threats from both ...
Synchronization for fast and reentrant operating system kernel tracing
Focus on Selected PhD Literature Reviews in the Practical Aspects of Software Technology

To effectively trace an operating system, a performance monitoring and debugging infrastructure needs the ability to trace various execution contexts. These contexts range from kernel running as a thread to Non-Maskable Interrupt ( NMI) contexts. Given ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 14, Issue 1

March 2017

258 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3058793

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 March 2017

Accepted: 01 January 2017

Revised: 01 December 2016

Received: 01 May 2016

Published in TACO Volume 14, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

IBM CAS Faculty Fellowship
Chinese Scholarship Council
National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
800
Total Downloads

Downloads (Last 12 months)83
Downloads (Last 6 weeks)19

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Luz HSouza PSouza S(2024)Structural testing for CUDA programming modelConcurrency and Computation: Practice and Experience10.1002/cpe.810536:14Online publication date: 9-Apr-2024
https://doi.org/10.1002/cpe.8105
Cogumbreiro TLange JLiew DZicarelli H(2023)Memory access protocols: certified data-race freedom for GPU kernelsFormal Methods in System Design10.1007/s10703-023-00415-0Online publication date: 26-May-2023
https://doi.org/10.1007/s10703-023-00415-0
Liew DCogumbreiro TLange J(2022)Provable GPU Data-Races in Static Race DetectionElectronic Proceedings in Theoretical Computer Science10.4204/EPTCS.356.4356(36-45)Online publication date: 24-Mar-2022
https://doi.org/10.4204/EPTCS.356.4
Guo YLi PLuo YWang XWang ZRastogi ATufano RBavota GArnaoudova VHaiduc S(2022)Exploring GNN based program embedding technologies for binary related tasksProceedings of the 30th IEEE/ACM International Conference on Program Comprehension10.1145/3524610.3527900(366-377)Online publication date: 16-May-2022
https://dl.acm.org/doi/10.1145/3524610.3527900
Cogumbreiro TLange JRong DZicarelli H(2021)Checking Data-Race Freedom of GPU Kernels, CompositionallyComputer Aided Verification10.1007/978-3-030-81685-8_19(403-426)Online publication date: 20-Jul-2021
https://dl.acm.org/doi/10.1007/978-3-030-81685-8_19
Bora UDas SKukreja PJoshi SUpadrasta RRajopadhye S(2020)LLOVACM Transactions on Architecture and Code Optimization10.1145/341859717:4(1-26)Online publication date: 22-Dec-2020
https://dl.acm.org/doi/10.1145/3418597
van den Haak LWijs Avan den Brand MHuisman M(2020)Formal Methods for GPGPU Programming: Is the Demand Met?Integrated Formal Methods10.1007/978-3-030-63461-2_9(160-177)Online publication date: 16-Nov-2020
https://dl.acm.org/doi/10.1007/978-3-030-63461-2_9
Kang JLim JYu H(2020)Partial migration technique for GPGPU tasks to Prevent GPU Memory Starvation in RPC‐based GPU VirtualizationSoftware: Practice and Experience10.1002/spe.280150:6(948-972)Online publication date: 11-Feb-2020
https://doi.org/10.1002/spe.2801
Jin LShu XLi KLi ZQi GTang J(2019)Deep Ordinal Hashing With Spatial AttentionIEEE Transactions on Image Processing10.1109/TIP.2018.288352228:5(2173-2186)Online publication date: 1-May-2019
https://dl.acm.org/doi/10.1109/TIP.2018.2883522
Wu MZhang LLiu CTan SZhang YZimmermann TLawall JMarinov D(2019)Automating CUDA synchronization via program transformationProceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering10.1109/ASE.2019.00075(748-759)Online publication date: 10-Nov-2019
https://dl.acm.org/doi/10.1109/ASE.2019.00075

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents