Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3477132.3483545acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
research-article

iGUARD: In-GPU Advanced Race Detection

Published: 26 October 2021 Publication History

Abstract

Newer use cases of GPU (Graphics Processing Unit) computing, e.g., graph analytics, look less like traditional bulk-synchronous GPU programs. To cater to the needs of emerging applications with semantically richer and finer grain sharing patterns, GPU vendors have been introducing advanced programming features, e.g., scoped synchronization and independent thread scheduling. While these features can speed up many applications and enable newer use cases, they can also introduce subtle synchronization errors if used incorrectly.
We present iGUARD, a runtime software tool to detect races in GPU programs due to incorrect use of such advanced features. A key need for a race detector to be practical is to accurately detect races at reasonable overheads. We thus perform the race detection on the GPU itself without relying on the CPU. The GPU's parallelism helps speed up race detection by 15x over a closely related prior work. Importantly, iGUARD detects newer types of races that were hitherto not possible for any known tool. It detected previously unknown subtle bugs in popular GPU programs, including three in NVIDIA supported commercial libraries. In total, iGUARD detected 57 races in 21 GPU programs, without false positives.

References

[1]
Jade Alglave, Mark Batty, Alastair F. Donaldson, Ganesh Gopalakrishnan, Jeroen Ketema, Daniel Poetzl, Tyler Sorensen, and John Wickerson. 2015. GPU Concurrency: Weak Behaviours and Programming Assumptions. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (Istanbul, Turkey) (ASPLOS '15). ACM, New York, NY, USA, 577--591. https://doi.org/10.1145/2694344.2694391
[2]
Saman Ashkiani, Martin Farach-Colton, and John D. Owens. 2018. A Dynamic Hash Table for the GPU. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE Computer Society, USA, 419--429. https://doi.org/10.1109/IPDPS.2018.00052
[3]
Ethel Bardsley, Adam Betts, Nathan Chong, Peter Collingbourne, Pantazis Deligiannis, Alastair F. Donaldson, Jeroen Ketema, Daniel Liew, and Shaz Qadeer. 2014. Engineering a Static Verification Tool for GPU Kernels. In Proceedings of the 16th International Conference on Computer Aided Verification - Volume 8559. Springer-Verlag, Berlin, Heidelberg, 226--242. https://doi.org/10.1007/978-3-319-08867-9_15
[4]
Ethel Bardsley and Alastair F. Donaldson. 2014. Warps and Atomics: Beyond Barrier Synchronization in the Verification of GPU Kernels. In Proceedings of the 6th International Symposium on NASA Formal Methods - Volume 8430. Springer-Verlag New York, Inc., New York, NY, USA, 230--245. https://doi.org/10.1007/978-3-319-06200-6_18
[5]
Michael A. Bender, Jeremy T. Fineman, Seth Gilbert, and Charles E. Leiserson. 2004. On-the-fly Maintenance of Series-parallel Relationships in Fork-join Multithreaded Programs. In Proceedings of the Sixteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures (Barcelona, Spain) (SPAA '04). ACM, New York, NY, USA, 133--144. https://doi.org/10.1145/1007912.1007933
[6]
Adam Betts, Nathan Chong, Alastair Donaldson, Shaz Qadeer, and Paul Thomson. 2012. GPUVerify: A Verifier for GPU Kernels. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (Tucson, Arizona, USA) (OOPSLA '12). ACM, New York, NY, USA, 113--132. https://doi.org/10.1145/2384616.2384625
[7]
Adam Betts, Nathan Chong, Alastair F. Donaldson, Jeroen Ketema, Shaz Qadeer, Paul Thomson, and John Wickerson. 2015. The Design and Implementation of a Verification Technique for GPU Kernels. ACM Trans. Program. Lang. Syst. 37, 3, Article 10 (May 2015), 49 pages. https://doi.org/10.1145/2743017
[8]
Michael D. Bond, Katherine E. Coons, and Kathryn S. McKinley. 2010. PACER: Proportional Detection of Data Races. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (Toronto, Ontario, Canada) (PLDI '10). ACM, New York, NY, USA, 255--268. https://doi.org/10.1145/1806596.1806626
[9]
Michael Boyer, Kevin Skadron, and Westley Weimer. 2008. Automated Dynamic Analysis of CUDA Programs. In 2008 Workshop on Software Tools for MultiCore Systems.
[10]
Martin Burtscher, Rupesh Nasre, and Keshav Pingali. 2012. A Quantitative Study of Irregular Programs on GPUs. In Proceedings of the 2012 IEEE International Symposium on Workload Characterization (IISWC) (IISWC '12). IEEE Computer Society, Washington, DC, USA, 141--151. https://doi.org/10.1109/IISWC.2012.6402918
[11]
Martin Burtscher, Rupesh Nasre, and Keshav Pingali. 2021. Lones-tarGPU. https://iss.oden.utexas.edu/?p=projects/galois/lonestargpu.
[12]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC) (IISWC '09). IEEE Computer Society, Washington, DC, USA, 44--54. https://doi.org/10.1109/IISWC.2009.5306797
[13]
Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. 2010. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (Pittsburgh, Pennsylvania, USA) (GPGPU-3). Association for Computing Machinery, New York, NY, USA, 63--74. https://doi.org/10.1145/1735688.1735702
[14]
Dimitar Dimitrov, Martin Vechev, and Vivek Sarkar. 2015. Race Detection in Two Dimensions. In Proceedings of the 27th ACM Symposium on Parallelism in Algorithms and Architectures (Portland, Oregon, USA) (SPAA '15). ACM, New York, NY, USA, 101--110. https://doi.org/10.1145/2755573.2755601
[15]
Anne Dinning and Edith Schonberg. 1991. Detecting Access Anomalies in Programs with Critical Sections. In Proceedings of the 1991 ACM/ONR Workshop on Parallel and Distributed Debugging (Santa Cruz, California, USA) (PADD '91). ACM, New York, NY, USA, 85--96. https://doi.org/10.1145/122759.122767
[16]
Laura Effinger-Dean, Brandon Lucia, Luis Ceze, Dan Grossman, and Hans-J. Boehm. 2012. IFRit: Interference-free Regions for Dynamic Data-race Detection. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (Tucson, Arizona, USA) (OOPSLA '12). ACM, New York, NY, USA, 467--484. https://doi.org/10.1145/2384616.2384650
[17]
Ariel Eizenberg, Yuanfeng Peng, Toma Pigli, William Mansky, and Joseph Devietti. 2017. BARRACUDA: Binary-level Analysis of Runtime RAces in CUDA Programs. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (Barcelona, Spain) (PLDI 2017). ACM, New York, NY, USA, 126--140. https://doi.org/10.1145/3062341.3062342
[18]
Tayfun Elmas, Shaz Qadeer, and Serdar Tasiran. 2007. Goldilocks: A Race and Transaction-aware Java Runtime. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (San Diego, California, USA) (PLDI '07). ACM, New York, NY, USA, 245--255. https://doi.org/10.1145/1250734.1250762
[19]
Ahmed ElTantawy and Tor M. Aamodt. 2016. MIMD Synchronization on SIMT Architectures. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture (Taipei, Taiwan) (MICRO-49). IEEE Press, Article 11, 14 pages.
[20]
Cormac Flanagan and Stephen N. Freund. 2009. FastTrack: Efficient and Precise Dynamic Race Detection. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation (Dublin, Ireland) (PLDI '09). ACM, New York, NY, USA, 121--133. https://doi.org/10.1145/1542476.1542490
[21]
Wilson W. L. Fung, Inderpreet Singh, Andrew Brownsword, and Tor M. Aamodt. 2011. Hardware Transactional Memory for GPU Architectures. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (Porto Alegre, Brazil) (MICRO-44). Association for Computing Machinery, New York, NY, USA, 296--307. https://doi.org/10.1145/2155620.2155655
[22]
Olivier Giroux, Luke Durant, Mark Harris, and Nick Stam. 2017. Inside Volta: The World's Most Advanced Data Center GPU. https://devblogs.nvidia.com/inside-volta/. Accessed: 2019-11-20.
[23]
Mark Harris. 2017. Unified Memory for CUDA Beginners. https://developer.nvidia.com/blog/unified-memory-cuda-beginners/.
[24]
Mark Harris and Kyrylo Perelygin. 2017. Cooperative Groups: Flexible CUDA Thread Programming. https://developer.nvidia.com/blog/cooperative-groups/. Accessed: 2020-11-19.
[25]
Anup Holey, Vineeth Mekkat, and Antonia Zhai. 2013. HAccRG: Hardware-Accelerated Data Race Detection in GPUs. In Proceedings of the 2013 42Nd International Conference on Parallel Processing (ICPP '13). IEEE Computer Society, Washington, DC, USA, 60--69. https://doi.org/10.1109/ICPP.2013.15
[26]
Derek R. Hower, Blake A. Hechtman, Bradford M. Beckmann, Benedict R. Gaster, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2014. Heterogeneous-race-free Memory Models. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (Salt Lake City, Utah, USA) (ASPLOS '14). ACM, New York, NY, USA, 427--440. https://doi.org/10.1145/2541940.2541981
[27]
Aditya K Kamath, Alvin A George, and Arkaprava Basu. 2019. Scoped Racey Benchmark Suite. https://github.com/csl-iisc/ScoR/. Accessed: 2020-11-15.
[28]
Aditya K. Kamath, Alvin A. George, and Arkaprava Basu. 2020. ScoRD: A Scoped Race Detector for GPUs. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (Virtual Event) (ISCA '20). IEEE Press, 1036--1049. https://doi.org/10.1109/ISCA45697.2020.00088
[29]
Leslie Lamport. 1978. Time, Clocks, and the Ordering of Events in a Distributed System. Commun. ACM 21, 7 (July 1978), 558--565. https://doi.org/10.1145/359545.359563
[30]
Pengcheng Li, Chen Ding, Xiaoyu Hu, and Tolga Soyata. 2014. LDetector: A low overhead race detector for GPU programs. In 5th Workshop on Determinism and Correctness in Parallel Programming (WODET2014).
[31]
Christopher Lidbury and Alastair F. Donaldson. 2017. Dynamic Race Detection for C++11. In Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages (Paris, France) (POPL 2017). ACM, New York, NY, USA, 443--457. https://doi.org/10.1145/3009837.3009857
[32]
Shan Lu, Joseph Tucek, Feng Qin, and Yuanyuan Zhou. 2006. AVIO: Detecting Atomicity Violations via Access Interleaving Invariants. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, California, USA) (ASPLOS XII). Association for Computing Machinery, New York, NY, USA, 37--48. https://doi.org/10.1145/1168857.1168864
[33]
Duane Merrill. 2015. Cub: Cuda unbound. http://nvlabs.github.io/cub (2015).
[34]
NVIDIA. 2021. CUDA C++Programming Guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide/. Accessed: 2021-05-07.
[35]
NVIDIA. 2021. CUDA Samples. https://docs.nvidia.com/cuda/cuda-samples/index.html. Accessed: 2021-05-07.
[36]
NVIDIA. 2021. Parallel Thread Execution ISA Version 7.3. https://docs.nvidia.com/cuda/paraMel-thread-execution/. Accessed: 2021-05-07.
[37]
NVIDIA. 2021. Racecheck Tool. https://docs.nvidia.com/cuda/cuda-memcheck/index.html. Accessed: 2021-05-07.
[38]
Robert O'Callahan and Jong-Deok Choi. 2003. Hybrid Dynamic Data Race Detection. In Proceedings of the Ninth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (San Diego, California, USA) (PPoPP '03). ACM, New York, NY, USA, 167--178. https://doi.org/10.1145/781498.781528
[39]
Yuanfeng Peng, Vinod Grover, and Joseph Devietti. 2018. CURD: A Dynamic CUDA Race Detector. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (Philadelphia, PA, USA) (PLDI 2018). ACM, New York, NY, USA, 390--403. https://doi.org/10.1145/3192366.3192368
[40]
Sebastian Raschka, Joshua Patterson, and Corey Nolet. 2020. Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence. arXiv preprint arXiv:2002.04803 (2020).
[41]
Jason Sanders and Edward Kandrot. 2010. CUDA by Example: An Introduction to General-Purpose GPU Programming (1st ed.). Addison-Wesley Professional, Boston, MA, USA.
[42]
Jason Sanders and Edward Kandrot. 2021. CUDA By Example - Errata Page. https://developer.nvidia.com/cuda-example-errata-page. Accessed: 2020-05-01.
[43]
Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. 1997. Eraser: A Dynamic Data Race Detector for Multithreaded Programs. ACM Trans. Comput. Syst. 15, 4 (Nov. 1997), 391--411. https://doi.org/10.1145/265924.265927
[44]
Konstantin Serebryany and Timur Iskhodzhanov. 2009. Thread-Sanitizer: Data Race Detection in Practice. In Proceedings of the Workshop on Binary Instrumentation and Applications (New York, New York, USA) (WBIA '09). ACM, New York, NY, USA, 62--71. https://doi.org/10.1145/1791194.1791203
[45]
Matthew D. Sinclair, Johnathan Alsop, and Sarita V. Adve. 2017. HeteroSync: A Benchmark Suite for Fine-Grained Synchronization on Tightly Coupled GPUs. In IEEE International Symposium on Workload Characterization (IISWC).
[46]
Oreste Villa, Mark Stephenson, David Nellans, and Stephen W. Keckler. 2019. NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (Columbus, OH, USA) (MICRO '52). Association for Computing Machinery, New York, NY, USA, 372--383. https://doi.org/10.1145/3352460.3358307
[47]
Oreste Villa, Zi Yan, and David Nellans. 2019. NVBit Source Code. https://github.com/NVlabs/NVBit/. Accessed: 2020-11-15.
[48]
Kai Wang, Don Fussell, and Calvin Lin. 2019. Fast Fine-Grained Global Synchronization on GPUs. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (Providence, RI, USA) (ASPLOS 19). Association for Computing Machinery, New York, NY, USA, 793--806. https://doi.org/10.1145/3297858.3304055
[49]
Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. 2016. Gunrock: A High-Performance Graph Processing Library on the GPU. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Barcelona, Spain) (PPoPP '16). Association for Computing Machinery, New York, NY, USA, Article 11, 12 pages. https://doi.org/10.1145/2851141.2851145
[50]
Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. 2021. Gunrock. https://github.com/gunrock/gunrock.
[51]
Benjamin Wester, David Devecsery, Peter M. Chen, Jason Flinn, and Satish Narayanasamy. 2013. Parallelizing Data Race Detection. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (Houston, Texas, USA) (ASPLOS 13). Association for Computing Machinery, New York, NY, USA, 27--38. https://doi.org/10.1145/2451116.2451120
[52]
Mingyuan Wu, Yicheng Ouyang, Husheng Zhou, Lingming Zhang, Cong Liu, and Yuqun Zhang. 2020. Simulee: Detecting CUDA Synchronization Bugs via Memory-Access Modeling. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (Seoul, South Korea) (ICSE '20). Association for Computing Machinery, New York, NY, USA, 937--948. https://doi.org/10.1145/3377811.3380358
[53]
Ayse Yilmazer and David Kaeli. 2013. HQL: A Scalable Synchronization Mechanism for GPUs. In Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing (IPDPS '13). IEEE Computer Society, USA, 475--486. https://doi.org/10.1109/IPDPS.2013.82
[54]
Mai Zheng, Vignesh T. Ravi, Feng Qin, and Gagan Agrawal. 2011. GRace: A Low-overhead Mechanism for Detecting Data Races in GPU Programs. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (San Antonio, TX, USA) (PPoPP '11). ACM, New York, NY, USA, 135--146. https://doi.org/10.1145/1941553.1941574
[55]
Mai Zheng, Vignesh T. Ravi, Feng Qin, and Gagan Agrawal. 2014. GMRace: Detecting Data Races in GPU Programs via a Low-Overhead Scheme. IEEE Trans. Parallel Distrib. Syst. 25, 1 (Jan. 2014), 104--115. https://doi.org/10.1109/TPDS.2013.44
[56]
Pin Zhou, Radu Teodorescu, and Yuanyuan Zhou. 2007. HARD: Hardware-Assisted Lockset-based Race Detection. In 2007 IEEE 13th International Symposium on High Performance Computer Architecture (HPCA '07). IEEE, Piscataway, NJ, USA, 121--132. https://doi.org/10.1109/HPCA.2007.346191

Cited By

View all
  • (2024)Indigo3: A Parallel Graph Analytics Benchmark Suite for Exploring Implementation Styles and Common BugsACM Transactions on Parallel Computing10.1145/3665251Online publication date: 15-May-2024
  • (2023)cuCatch: A Debugging Tool for Efficiently Catching Memory Safety Violations in CUDA ApplicationsProceedings of the ACM on Programming Languages10.1145/35912257:PLDI(124-147)Online publication date: 6-Jun-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SOSP '21: Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles
October 2021
899 pages
ISBN:9781450387095
DOI:10.1145/3477132
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2021

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. Data races
  2. Debugging
  3. GPU program correctness

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

SOSP '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 131 of 716 submissions, 18%

Upcoming Conference

SOSP '24

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)118
  • Downloads (Last 6 weeks)15
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Indigo3: A Parallel Graph Analytics Benchmark Suite for Exploring Implementation Styles and Common BugsACM Transactions on Parallel Computing10.1145/3665251Online publication date: 15-May-2024
  • (2023)cuCatch: A Debugging Tool for Efficiently Catching Memory Safety Violations in CUDA ApplicationsProceedings of the ACM on Programming Languages10.1145/35912257:PLDI(124-147)Online publication date: 6-Jun-2023

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media