Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

LD: Low-Overhead GPU Race Detection Without Access Monitoring

Published: 21 March 2017 Publication History
  • Get Citation Alerts
  • Abstract

    Data race detection has become an important problem in GPU programming. Previous designs of CPU race-checking tools are mainly task parallel and incur high overhead on GPUs due to access instrumentation, especially when monitoring many thousands of threads routinely used by GPU programs.
    This article presents a novel data-parallel solution designed and optimized for the GPU architecture. It includes compiler support and a set of runtime techniques. It uses value-based checking, which detects the races reported in previous work, finds new races, and supports race-free deterministic GPU execution. More important, race checking is massively data parallel and does not introduce divergent branching or atomic synchronization. Its slowdown is less than 5 × for over half of the tests and 10 × on average, which is orders of magnitude more efficient than the cuda-memcheck tool by Nvidia and the methods that use fine-grained access instrumentation.

    Supplementary Material

    TACO1401-09 (taco1401-09.pdf)
    Slide deck associated with this paper

    References

    [1]
    John R. Allen and Ken Kennedy. 2001. Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann Publishers.
    [2]
    Raghesh Aloor and V. Krishna Nandivada. 2015. Unique worker model for OpenMP. In Proceedings of the International Conference on Supercomputing. 47--56.
    [3]
    Cristiana Amza, Alan L. Cox, Sandhya Dwarkadas, Peter J. Keleher, Honghui Lu, Ramakrishnan Rajamony, Weimin Yu, and Willy Zwaenepoel. 1996. TreadMarks: Shared memory computing on networks of workstations. IEEE Computer 29, 2 (1996), 18--28.
    [4]
    Tongxin Bai, Chen Ding, and Pengcheng Li. 2015. Assessing safe task parallelism in SPEC 2006 INT. In Proceedings of the 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.
    [5]
    Ethel Bardsley and Alastair F. Donaldson. 2014. Warps and atomics: Beyond barrier synchronization in the verification of GPU kernels. In Proceedings of the 6th International Symposium on NASA Formal Methods (NFM’14).
    [6]
    Tom Bergan, Owen Anderson, Joseph Devietti, Luis Ceze, and Dan Grossman. 2010. CoreDet: A compiler and runtime system for deterministic multithreaded execution. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 53--64.
    [7]
    Emery D. Berger, Ting Yang, Tongping Liu, and Gene Novark. 2009. Grace: Safe multithreaded programming for C/C++. In Proceedings of the International Conference on Object Oriented Programming, Systems, Languages and Applications. 81--96.
    [8]
    Adam Betts, Nathan Chong, Alastair Donaldson, Shaz Qadeer, and Paul Thomson. 2012. GPUVerify: A verifier for GPU kernels. In Proceedings of the International Conference on Object Oriented Programming, Systems, Languages and Applications. 113--132.
    [9]
    Michael Boyer, Kevin Skadron, and Westley Weimer. 2008. Automated dynamic analysis of CUDA programs. In Proceedings of the 3rd Workshop on Software Tools for MultiCore Systems.
    [10]
    Sebastian Burckhardt, Alexandro Baldassin, and Daan Leijen. 2010. Concurrent programming with revisions and isolation types. In Proceedings of the International Conference on Object Oriented Programming, Systems, Languages and Applications. 691--707.
    [11]
    Wei-Fan Chiang, Ganesh Gopalakrishnan, Guodong Li, and Zvonimir Rakamaric. 2013. Formal analysis of GPU programs with atomics via conflict-directed delay-bounding. In Proceedings of NASA Formal Methods, 5th International Symposium (NFM’13).
    [12]
    Nathan Chong, Alastair F. Donaldson, and Jeroen Ketema. 2014. A sound and complete abstraction for reasoning about parallel prefix sums. In Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL'14).
    [13]
    Keith Cooper and Linda Torczon. 2010. Engineering a Compiler (2nd ed.). Morgan Kaufmann.
    [14]
    Joseph Devietti, Benjamin P. Wood, Karin Strauss, Luis Ceze, Dan Grossman, and Shaz Qadeer. 2012. RADISH: Always-on sound and complete ra detection in software and hardware. In Proceedings of the 39th Annual International Symposium on Computer Architecture.
    [15]
    Chen Ding, Brian Gernhart, Pengcheng Li, and Matthew Hertz. 2014. Safe Parallel Programming in An Interpreted Language. Technical Report URCS #991. Department of Computer Science, University of Rochester.
    [16]
    Chen Ding, Xipeng Shen, Kirk Kelsey, Chris Tice, Ruke Huang, and Chengliang Zhang. 2007. Software behavior oriented parallelization. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 223--234.
    [17]
    Laura Effinger-Dean, Brandon Lucia, Luis Ceze, Dan Grossman, and Hans-J. Boehm. 2012. IFRit: Interference-free regions for dynamic data-race detection. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications.
    [18]
    Cormac Flanagan and Stephen N. Freund. 2009. FastTrack: Efficient and precise dynamic race detection. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 121--133.
    [19]
    Junjie Gu, Zhiyuan Li, and Gyungho Lee. 1997. Experience with efficient array data-flow analysis for array privatization. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 157--167.
    [20]
    Anup Holey, Vineeth Mekkat, and Antonia Zhai. 2013. HAccRG: Hardware-accelerated data race detection in GPUs. In ICPP.
    [21]
    Qiming Hou, Kun Zhou, and Baining Guo. 2009. Debugging GPU stream programs through automatic dataflow recording and visualization. In ACM SIGGRAPH Asia 2009 Papers.
    [22]
    Weixing Ji, Li Lu, and Michael L. Scott. 2013. TARDIS: Task-level access race detection by intersecting sets. In Proceedings of the Workshop on Determinism and Correctness in Parallel Programming.
    [23]
    Hadi Jooybar, Wilson W. L. Fung, Mike O’Connor, Joseph Devietti, and Tor M. Aamodt. 2013. GPUDet: A deterministic GPU architecture. In ASPLOS.
    [24]
    Chuanle Ke, Lei Liu, Chao Zhang, Tongxin Bai, Bryan Jacobs, and Chen Ding. 2011. Safe parallel programming using dynamic dependence hints. In Proceedings of the International Conference on Object Oriented Programming, Systems, Languages and Applications. 243--258.
    [25]
    Kirk Kelsey, Tongxin Bai, and Chen Ding. 2009. Fast track: A software system for speculative optimization. In Proceedings of the International Symposium on Code Generation and Optimization. 157--168.
    [26]
    Olaf Krzikalla. 2011. Scout: A Source-to-Source Translator for SIMD-Optimizations. Proceedings of the https://tu-dresden.de/zih/forschung/projekte/scout/.
    [27]
    Milind Kulkarni, Martin Burtscher, Calin Casçaval, and Keshav Pingali. 2009. Lonestar: A suite of parallel irregular programs. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’09).
    [28]
    Alan Leung, Manish Gupta, Yuvraj Agarwal, Rajesh Gupta, Ranjit Jhala, and Sorin Lerner. 2012. Verifying GPU kernels by test amplification. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 383--394.
    [29]
    Guodong Li and Ganesh Gopalakrishnan. 2010. Scalable SMT-based verification of GPU kernel functions. In Proceedings of the 18th ACM SIGSOFT International Symposium on Foundations of Software Engineering. 187--196.
    [30]
    Guodong Li, Peng Li, Geof Sawaya, Ganesh Gopalakrishnan, Indradeep Ghosh, and Sreeranga P. Rajan. 2012. GKLEE: Concolic verification and test generation for GPUs. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 215--224.
    [31]
    Pengcheng Li, Chen Ding, Xiaoyu Hu, and Tolga Soyata. 2014. LDetector: A low overhead race detector for GPU programs. In Proceedings of the 5th Workshop on Determinism and Correctness in Parallel Programming.
    [32]
    Pengcheng Li, Ziang Hu, and Handong Ye. 2015. Compiler and Method for Global-Scope Basic-Block Reordering. https://www.google.com/patents/US20150040106 US Patent App. 14/445,983.
    [33]
    Peng Li, Guodong Li, and Ganesh Gopalakrishnan. 2012. Parametric flows: Automated behavior equivalencing for symbolic analysis of races in CUDA programs. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis.
    [34]
    Peng Li, Guodong Li, and Ganesh Gopalakrishnan. 2014. Practical symbolic race checking of GPU programs. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14).
    [35]
    Pengcheng Li, Hao Luo, Chen Ding, Ziang Hu, and Handong Ye. 2014. Code layout optimization for defensiveness and politeness in shared cache. In Proceedings of the 2014 43rd International Conference on Parallel Processing. 151--161.
    [36]
    Zhiyuan Li. 1992. Array privatization for parallel execution of loops. In Proceedings of the International Conference on Supercomputing. 313--322.
    [37]
    Li Lu, Weixing Ji, and Michael L. Scott. 2014. Dynamic enforcement of determinism in a parallel scripting language. In PLDI.
    [38]
    Wenjing Ma and Gagan Agrawal. 2010. An integer programming framework for optimizing shared memory use on GPUs. In PACT.
    [39]
    NVIDIA. 2014. Cuda Memcheck Tool. Retrieved from https://developer.nvidia.com/CUDA-MEMCHECK.
    [40]
    NVIDIA. 2016. CUDA C Programming Guide. Retrieved from http://docs.nvidia.com/cuda/cuda-c-programming-guide/.
    [41]
    Dejan Perkovic and Peter J. Keleher. 1996. Online data-race detection via coherency guarantees. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation.
    [42]
    Dejan Perkovic and Peter J. Keleher. 2000. A protocol-centric approach to on-the-fly race detection. IEEE Transactions on Parallel and Distributed Systems 11, 10 (2000), 1058--1072.
    [43]
    Arun Raman, Hanjun Kim, Thomas R. Mason, Thomas B. Jablin, and David I. August. 2010. Speculative parallelization using software multi-threaded transactions. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 65--76.
    [44]
    Raghavan Raman, Jisheng Zhao, Vivek Sarkar, Martin T. Vechev, and Eran Yahav. 2012. Scalable and precise dynamic datarace detection for structured parallelism. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation.
    [45]
    Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. 1997. Eraser: A dynamic data race detector for multi-threaded programs. In Proceedings of the 16th ACM Symposium on Operating Systems Principles.
    [46]
    Michael L. Scott. 2013. Shared-Memory Synchronization. Morgan 8 Claypool Publishers.
    [47]
    John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. IEEE Design Test 12, 3 (2010), 66--72.
    [48]
    Chen Tian, Min Feng, and Rajiv Gupta. 2010. Supporting speculative parallelization in the presence of dynamic data structures. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 62--73.
    [49]
    UIUC. 2012. The Parboil Benchmark Suite. Retrieved from http://impact.crhc.illinois.edu/parboil/parboil.aspx.
    [50]
    Leslie G. Valiant. 1990. A bridging model for parallel computation. Communications of the ACM 33, 8 (Aug. 1990), 103--111.
    [51]
    Kaushik Veeraraghavan, Dongyoon Lee, Benjamin Wester, Jessica Ouyang, Peter M. Chen, Jason Flinn, and Satish Narayanasamy. 2011. DoublePlay: Parallelizing sequential logging and replay. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 15--26.
    [52]
    Hongtao Yu, Hou-Jen Ko, and Zhiyuan Li. 2013. General data structure expansion for multi-threading. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 243--252.
    [53]
    Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. 2011. On-the-fly elimination of dynamic irregularities for GPU computing. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 369--380.
    [54]
    Mai Zheng, Vignesh T. Ravi, Feng Qin, and Gagan Agrawal. 2011. GRace: A low-overhead mechanism for detecting data races in GPU programs. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 135--146.
    [55]
    Mai Zheng, Vignesh T. Ravi, Feng Qin, and Gagan Agrawal. 2014. GMRace: Detecting data races in GPU programs via a low-overhead scheme. IEEE Transactions on Parallel and Distributed Systems 25 (2014), 104--115.
    [56]
    Pin Zhou, Radu Teodorescu, and Yuanyuan Zhou. 2007. HARD: Hardware-assisted lockset-based race detection. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

    Cited By

    View all
    • (2024)Structural testing for CUDA programming modelConcurrency and Computation: Practice and Experience10.1002/cpe.810536:14Online publication date: 9-Apr-2024
    • (2023)Memory access protocols: certified data-race freedom for GPU kernelsFormal Methods in System Design10.1007/s10703-023-00415-0Online publication date: 26-May-2023
    • (2022)Provable GPU Data-Races in Static Race DetectionElectronic Proceedings in Theoretical Computer Science10.4204/EPTCS.356.4356(36-45)Online publication date: 24-Mar-2022
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 14, Issue 1
    March 2017
    258 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/3058793
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 March 2017
    Accepted: 01 January 2017
    Revised: 01 December 2016
    Received: 01 May 2016
    Published in TACO Volume 14, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GPU race detection
    2. instrumentation-free
    3. low overhead
    4. value-based checking

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)83
    • Downloads (Last 6 weeks)19
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Structural testing for CUDA programming modelConcurrency and Computation: Practice and Experience10.1002/cpe.810536:14Online publication date: 9-Apr-2024
    • (2023)Memory access protocols: certified data-race freedom for GPU kernelsFormal Methods in System Design10.1007/s10703-023-00415-0Online publication date: 26-May-2023
    • (2022)Provable GPU Data-Races in Static Race DetectionElectronic Proceedings in Theoretical Computer Science10.4204/EPTCS.356.4356(36-45)Online publication date: 24-Mar-2022
    • (2022)Exploring GNN based program embedding technologies for binary related tasksProceedings of the 30th IEEE/ACM International Conference on Program Comprehension10.1145/3524610.3527900(366-377)Online publication date: 16-May-2022
    • (2021)Checking Data-Race Freedom of GPU Kernels, CompositionallyComputer Aided Verification10.1007/978-3-030-81685-8_19(403-426)Online publication date: 20-Jul-2021
    • (2020)LLOVACM Transactions on Architecture and Code Optimization10.1145/341859717:4(1-26)Online publication date: 22-Dec-2020
    • (2020)Formal Methods for GPGPU Programming: Is the Demand Met?Integrated Formal Methods10.1007/978-3-030-63461-2_9(160-177)Online publication date: 16-Nov-2020
    • (2020)Partial migration technique for GPGPU tasks to Prevent GPU Memory Starvation in RPC‐based GPU VirtualizationSoftware: Practice and Experience10.1002/spe.280150:6(948-972)Online publication date: 11-Feb-2020
    • (2019)Deep Ordinal Hashing With Spatial AttentionIEEE Transactions on Image Processing10.1109/TIP.2018.288352228:5(2173-2186)Online publication date: 1-May-2019
    • (2019)Automating CUDA synchronization via program transformationProceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering10.1109/ASE.2019.00075(748-759)Online publication date: 10-Nov-2019

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media