Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

GPUDet: a deterministic GPU architecture

Published: 16 March 2013 Publication History

Abstract

Nondeterminism is a key challenge in developing multithreaded applications. Even with the same input, each execution of a multithreaded program may produce a different output. This behavior complicates debugging and limits one's ability to test for correctness. This non-reproducibility situation is aggravated on massively parallel architectures like graphics processing units (GPUs) with thousands of concurrent threads. We believe providing a deterministic environment to ease debugging and testing of GPU applications is essential to enable a broader class of software to use GPUs.
Many hardware and software techniques have been proposed for providing determinism on general-purpose multi-core processors. However, these techniques are designed for small numbers of threads. Scaling them to thousands of threads on a GPU is a major challenge. This paper proposes a scalable hardware mechanism, GPUDet, to provide determinism in GPU architectures. In this paper we characterize the existing deterministic and nondeterministic aspects of current GPU execution models, and we use these observations to inform GPUDet's design. For example, GPUDet leverages the inherent determinism of the SIMD hardware in GPUs to provide determinism within a wavefront at no cost. GPUDet also exploits the Z-Buffer Unit, an existing GPU hardware unit for graphics rendering, to allow parallel out-of-order memory writes to produce a deterministic output. Other optimizations in GPUDet include deterministic parallel execution of atomic operations and a workgroup-aware algorithm that eliminates unnecessary global synchronizations.
Our simulation results indicate that GPUDet incurs only 2X slowdown on average over a baseline nondeterministic architecture, with runtime overheads as low as 4% for compute-bound applications, despite running GPU kernels with thousands of threads. We also characterize the sources of overhead for deterministic execution on GPUs to provide insights for further optimizations.

References

[1]
http://www.ece.ubc.ca/~aamodt/GPUDet.
[2]
White Paper | AMD Graphics Cores Next (GCN) Architecture. AMD, June 2012.
[3]
D. Arnold et al. Stack Trace Analysis for Large Scale Debugging. In IPDPS, 2007.
[4]
A. Aviram, S.-C. Weng, S. Hu, and B. Ford. Efficient system-enforced deterministic parallelism. In OSDI, 2010.
[5]
A. Bakhoda et al. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In ISPASS, 2009.
[6]
T. Bergan, O. Anderson, J. Devietti, L. Ceze, and D. Grossman. CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution. In ASPLOS, 2010.
[7]
T. Bergan, N. Hunt, L. Ceze, and S. D. Gribble. Deterministic Process Groups in dOS. In OSDI, 2010.
[8]
A. Betts, N. Chong, A. F. Donaldson, S. Qadeer, and P. Thomson. GPUVerify: a verifier for GPU kernels. In Proceedings of the 27th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA'12). ACM, 2012.
[9]
G. Blelloch. NESL: A Nested Data-Parallel Language (Version 3.1). Technical report, Carnegie Mellon University, Pittsburgh, PA, 2007.
[10]
R. L. Bocchino, Jr., V. S. Adve, D. Dig, S. V. Adve, S. Heumann, R. Komuravelli, J. Overbey, P. Simmons, H. Sung, and M. Vakilian. A Type and Effect System for Deterministic Parallel Java. In OOPSLA, 2009.
[11]
M. Boyer, K. Skadron, and W. Weimer. Automated Dynamic Analysis of CUDA Programs. In Third Workshop on Software Tools for MultiCore Systems, 2008.
[12]
A. Brownsword. Cloth in OpenCL, 2009.
[13]
M. M. T. Chakravarty, R. Leshchinskiy, S. Peyton Jones, G. Keller, and S. Marlow. Data Parallel Haskell: A Status Report. In DAMP, 2007.
[14]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IISWC, 2009.
[15]
B. W. Coon et al. United States Patent#7,353,369: System and Method for Managing Divergent Threads in a SIMD Architecture (Assignee NVIDIA Corp.), April 2008.
[16]
J. Devietti, B. Lucia, L. Ceze, and M. Oskin. DMP: Deterministic Shared Memory Multiprocessing. In ASPLOS, 2009.
[17]
J. Devietti, J. Nelson, T. Bergan, L. Ceze, and D. Grossman. RCDC: A Relaxed Consistency Deterministic Computer. In ASPLOS, 2011.
[18]
S. A. Edwards and O. Tardieu. SHIM: A Deterministic Model for Heterogeneous Embedded Systems. In EMSOFT, 2005.
[19]
W. W. L. Fung, I. Singh, A. Brownsword, and T. M. Aamodt. Hardware Transactional Memory for GPU Architectures. In MICRO-44, 2011.
[20]
W. Fung et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In MICRO, 2007.
[21]
P. Harish and P. J. Narayanan. Accelerating Large Graph Algorithms on the GPU Using CUDA. In HiPC, 2007.
[22]
M. Hill and M. Xu. http://www.cs.wisc.edu/ markhill/racey.html, 2009.
[23]
D. Hower, P. Dudnik, M. Hill, and D. Wood. Calvin: Deterministic or Not? Free Will to Choose. In HPCA, 2011.
[24]
Khronos Group. OpenCL. http://www.khronos.org/opencl/.
[25]
S. Laine and T. Karras. High-Performance Software Rasterization on GPUs. In HPG, 2011.
[26]
G. H. Lars Nyland, John R. Nickolls and T. Mandal. United States Patent#8,086,806: Systems and methods for coalescing memory accesses of parallel threads (Assignee NVIDIA Corp.), April 2011.
[27]
C. E. Leiserson and T. B. Schardl. A Work-Efficient Parallel Breadth-First Search Algorithm (or How to Cope with the Nondeterminism of Reducers). In SPAA, 2010.
[28]
G. Li and G. Gopalakrishnan. Scalable SMT-Based Verification of GPU Kernel Functions. In FSE, 2010.
[29]
E. Lindholm et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture. Micro, IEEE, 2008.
[30]
J. Liu, B. Jaiyen, R. Veras, and O. Multu. RAIDR: Retention-Aware Intelligent DRAM Refresh. In ISCA, 2012.
[31]
T. Liu, C. Curtsinger, and E. D. Berger. DTHREADS: Efficient Deterministic Multithreading. In SOSP, 2011.
[32]
NVIDIA's Next Generation CUDA Compute Architecture: Fermi. NVIDIA, October 2009.
[33]
NVIDIA CUDA Programming Guide v3.1. NVIDIA Corp., 2010.
[34]
NVML API Reference Manual v3.295.45. NVIDIA Corp., 2012.
[35]
M. Olszewski, J. Ansel, and S. Amarasinghe. Kendo: Efficient deterministic multithreading in software. In ASPLOS, 2009.
[36]
M. C. Rinard and M. S. Lam. The design, implementation, and evaluation of Jade. ACM Trans. Program. Lang. Syst., 20 (3), May 1998.
[37]
D. Sanchez, L. Yen, M. D. Hill, and K. Sankaralingam. Implementing Signatures for Transactional Memory. In MICRO, 2007.
[38]
S. R. Sarangi, B. Greskamp, and J. Torrellas. CADRE: Cycle-Accurate Deterministic Replay for Hardware Debugging. In DSN, 2006.
[39]
I. Singh, A. Shriraman, W. W. L. Fung, M. O'Connor, and T. M. Aamodt. Cache Coherence for GPU Architectures. In HPCA, 2013.
[40]
J. A. Stuart and J. D. Owens. Efficient Synchronization Primitives for GPUs. CoRR, abs/1110.4623, 2011.
[41]
W. Thies, M. Karczmarek, and S. P. Amarasinghe. StreamIt: A Language for Streaming Applications. In CC '02, 2002.
[42]
W. Thies, M. Karczmarek, J. Sermulins, R. Rabbah, and S. P. Amarasinghe. Teleport Messaging for Distributed Stream Programs. In PPoPP, 2005.
[43]
T. J. Van Hook. United States Patent#6,630,933: Method and Apparatus for Compression and Decompression of Z Data (Assignee ATI Technologies Inc.), October 2003.
[44]
S. Vangal et al. An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS. IEEE Journal of Solid-State Circuits, 43 (1): 29--41, Jan. 2008.
[45]
H. Wong et al. Demystifying GPU microarchitecture through microbenchmarking. In ISPASS, 2010.
[46]
M. Zheng, V. T. Ravi, F. Qin, and G. Agrawal. GRace: A Low-Overhead Mechanism for Detecting Data Races in GPU Programs. In PPoPP, 2011.

Cited By

View all
  • (2024)SimSYCL: A SYCL Implementation Targeting Development, Debugging, Simulation and ConformanceProceedings of the 12th International Workshop on OpenCL and SYCL10.1145/3648115.3648136(1-12)Online publication date: 8-Apr-2024
  • (2024)Atomic Cache: Enabling Efficient Fine-Grained Synchronization with Relaxed Memory Consistency on GPGPUs Through In-Cache Atomic Operations2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00056(671-685)Online publication date: 2-Nov-2024
  • (2022)Towards training reproducible deep learning modelsProceedings of the 44th International Conference on Software Engineering10.1145/3510003.3510163(2202-2214)Online publication date: 21-May-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News
ACM SIGARCH Computer Architecture News  Volume 41, Issue 1
ASPLOS '13
March 2013
540 pages
ISSN:0163-5964
DOI:10.1145/2490301
Issue’s Table of Contents
  • cover image ACM Conferences
    ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
    March 2013
    574 pages
    ISBN:9781450318709
    DOI:10.1145/2451116
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 March 2013
Published in SIGARCH Volume 41, Issue 1

Check for updates

Author Tags

  1. deterministic parallelism
  2. gpu

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)30
  • Downloads (Last 6 weeks)5
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)SimSYCL: A SYCL Implementation Targeting Development, Debugging, Simulation and ConformanceProceedings of the 12th International Workshop on OpenCL and SYCL10.1145/3648115.3648136(1-12)Online publication date: 8-Apr-2024
  • (2024)Atomic Cache: Enabling Efficient Fine-Grained Synchronization with Relaxed Memory Consistency on GPGPUs Through In-Cache Atomic Operations2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00056(671-685)Online publication date: 2-Nov-2024
  • (2022)Towards training reproducible deep learning modelsProceedings of the 44th International Conference on Software Engineering10.1145/3510003.3510163(2202-2214)Online publication date: 21-May-2022
  • (2022)A software-defined tensor streaming multiprocessor for large-scale machine learningProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527405(567-580)Online publication date: 18-Jun-2022
  • (2021)Accelerating DNN Architecture Search at Scale Using Selective Weight Transfer2021 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/Cluster48925.2021.00051(82-93)Online publication date: Sep-2021
  • (2020)Deterministic Atomic Buffering2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00083(981-995)Online publication date: Oct-2020
  • (2019)Lazy Determinism for Faster Deterministic MultithreadingProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304047(879-891)Online publication date: 4-Apr-2019
  • (2018)In-DRAM near-data approximate acceleration for GPUsProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243188(1-14)Online publication date: 1-Nov-2018
  • (2017)LDACM Transactions on Architecture and Code Optimization10.1145/304667814:1(1-25)Online publication date: 21-Mar-2017
  • (2015)High-performance determinism with total store order consistencyProceedings of the Tenth European Conference on Computer Systems10.1145/2741948.2741960(1-13)Online publication date: 17-Apr-2015
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media