research-article

GPUDet: a deterministic GPU architecture

Authors:

Wilson W.L. Fung,

Joseph Devietti,

Tor M. AamodtAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 41, Issue 1

Pages 1 - 12

https://doi.org/10.1145/2490301.2451118

Published: 16 March 2013 Publication History

Abstract

Nondeterminism is a key challenge in developing multithreaded applications. Even with the same input, each execution of a multithreaded program may produce a different output. This behavior complicates debugging and limits one's ability to test for correctness. This non-reproducibility situation is aggravated on massively parallel architectures like graphics processing units (GPUs) with thousands of concurrent threads. We believe providing a deterministic environment to ease debugging and testing of GPU applications is essential to enable a broader class of software to use GPUs.

Many hardware and software techniques have been proposed for providing determinism on general-purpose multi-core processors. However, these techniques are designed for small numbers of threads. Scaling them to thousands of threads on a GPU is a major challenge. This paper proposes a scalable hardware mechanism, GPUDet, to provide determinism in GPU architectures. In this paper we characterize the existing deterministic and nondeterministic aspects of current GPU execution models, and we use these observations to inform GPUDet's design. For example, GPUDet leverages the inherent determinism of the SIMD hardware in GPUs to provide determinism within a wavefront at no cost. GPUDet also exploits the Z-Buffer Unit, an existing GPU hardware unit for graphics rendering, to allow parallel out-of-order memory writes to produce a deterministic output. Other optimizations in GPUDet include deterministic parallel execution of atomic operations and a workgroup-aware algorithm that eliminates unnecessary global synchronizations.

Our simulation results indicate that GPUDet incurs only 2X slowdown on average over a baseline nondeterministic architecture, with runtime overheads as low as 4% for compute-bound applications, despite running GPU kernels with thousands of threads. We also characterize the sources of overhead for deterministic execution on GPUs to provide insights for further optimizations.

References

[1]

http://www.ece.ubc.ca/~aamodt/GPUDet.

[2]

White Paper | AMD Graphics Cores Next (GCN) Architecture. AMD, June 2012.

[3]

D. Arnold et al. Stack Trace Analysis for Large Scale Debugging. In IPDPS, 2007.

[4]

A. Aviram, S.-C. Weng, S. Hu, and B. Ford. Efficient system-enforced deterministic parallelism. In OSDI, 2010.

Digital Library

[5]

A. Bakhoda et al. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In ISPASS, 2009.

[6]

T. Bergan, O. Anderson, J. Devietti, L. Ceze, and D. Grossman. CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution. In ASPLOS, 2010.

Digital Library

[7]

T. Bergan, N. Hunt, L. Ceze, and S. D. Gribble. Deterministic Process Groups in dOS. In OSDI, 2010.

Digital Library

[8]

A. Betts, N. Chong, A. F. Donaldson, S. Qadeer, and P. Thomson. GPUVerify: a verifier for GPU kernels. In Proceedings of the 27th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA'12). ACM, 2012.

Digital Library

[9]

G. Blelloch. NESL: A Nested Data-Parallel Language (Version 3.1). Technical report, Carnegie Mellon University, Pittsburgh, PA, 2007.

Digital Library

[10]

R. L. Bocchino, Jr., V. S. Adve, D. Dig, S. V. Adve, S. Heumann, R. Komuravelli, J. Overbey, P. Simmons, H. Sung, and M. Vakilian. A Type and Effect System for Deterministic Parallel Java. In OOPSLA, 2009.

Digital Library

[11]

M. Boyer, K. Skadron, and W. Weimer. Automated Dynamic Analysis of CUDA Programs. In Third Workshop on Software Tools for MultiCore Systems, 2008.

[12]

A. Brownsword. Cloth in OpenCL, 2009.

[13]

M. M. T. Chakravarty, R. Leshchinskiy, S. Peyton Jones, G. Keller, and S. Marlow. Data Parallel Haskell: A Status Report. In DAMP, 2007.

Digital Library

[14]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IISWC, 2009.

Digital Library

[15]

B. W. Coon et al. United States Patent#7,353,369: System and Method for Managing Divergent Threads in a SIMD Architecture (Assignee NVIDIA Corp.), April 2008.

[16]

J. Devietti, B. Lucia, L. Ceze, and M. Oskin. DMP: Deterministic Shared Memory Multiprocessing. In ASPLOS, 2009.

Digital Library

[17]

J. Devietti, J. Nelson, T. Bergan, L. Ceze, and D. Grossman. RCDC: A Relaxed Consistency Deterministic Computer. In ASPLOS, 2011.

Digital Library

[18]

S. A. Edwards and O. Tardieu. SHIM: A Deterministic Model for Heterogeneous Embedded Systems. In EMSOFT, 2005.

Digital Library

[19]

W. W. L. Fung, I. Singh, A. Brownsword, and T. M. Aamodt. Hardware Transactional Memory for GPU Architectures. In MICRO-44, 2011.

Digital Library

[20]

W. Fung et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In MICRO, 2007.

Digital Library

[21]

P. Harish and P. J. Narayanan. Accelerating Large Graph Algorithms on the GPU Using CUDA. In HiPC, 2007.

Digital Library

[22]

M. Hill and M. Xu. http://www.cs.wisc.edu/ markhill/racey.html, 2009.

[23]

D. Hower, P. Dudnik, M. Hill, and D. Wood. Calvin: Deterministic or Not? Free Will to Choose. In HPCA, 2011.

Digital Library

[24]

Khronos Group. OpenCL. http://www.khronos.org/opencl/.

[25]

S. Laine and T. Karras. High-Performance Software Rasterization on GPUs. In HPG, 2011.

Digital Library

[26]

G. H. Lars Nyland, John R. Nickolls and T. Mandal. United States Patent#8,086,806: Systems and methods for coalescing memory accesses of parallel threads (Assignee NVIDIA Corp.), April 2011.

[27]

C. E. Leiserson and T. B. Schardl. A Work-Efficient Parallel Breadth-First Search Algorithm (or How to Cope with the Nondeterminism of Reducers). In SPAA, 2010.

Digital Library

[28]

G. Li and G. Gopalakrishnan. Scalable SMT-Based Verification of GPU Kernel Functions. In FSE, 2010.

Digital Library

[29]

E. Lindholm et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture. Micro, IEEE, 2008.

Digital Library

[30]

J. Liu, B. Jaiyen, R. Veras, and O. Multu. RAIDR: Retention-Aware Intelligent DRAM Refresh. In ISCA, 2012.

Digital Library

[31]

T. Liu, C. Curtsinger, and E. D. Berger. DTHREADS: Efficient Deterministic Multithreading. In SOSP, 2011.

Digital Library

[32]

NVIDIA's Next Generation CUDA Compute Architecture: Fermi. NVIDIA, October 2009.

[33]

NVIDIA CUDA Programming Guide v3.1. NVIDIA Corp., 2010.

[34]

NVML API Reference Manual v3.295.45. NVIDIA Corp., 2012.

[35]

M. Olszewski, J. Ansel, and S. Amarasinghe. Kendo: Efficient deterministic multithreading in software. In ASPLOS, 2009.

Digital Library

[36]

M. C. Rinard and M. S. Lam. The design, implementation, and evaluation of Jade. ACM Trans. Program. Lang. Syst., 20 (3), May 1998.

Digital Library

[37]

D. Sanchez, L. Yen, M. D. Hill, and K. Sankaralingam. Implementing Signatures for Transactional Memory. In MICRO, 2007.

Digital Library

[38]

S. R. Sarangi, B. Greskamp, and J. Torrellas. CADRE: Cycle-Accurate Deterministic Replay for Hardware Debugging. In DSN, 2006.

Digital Library

[39]

I. Singh, A. Shriraman, W. W. L. Fung, M. O'Connor, and T. M. Aamodt. Cache Coherence for GPU Architectures. In HPCA, 2013.

Digital Library

[40]

J. A. Stuart and J. D. Owens. Efficient Synchronization Primitives for GPUs. CoRR, abs/1110.4623, 2011.

[41]

W. Thies, M. Karczmarek, and S. P. Amarasinghe. StreamIt: A Language for Streaming Applications. In CC '02, 2002.

Digital Library

[42]

W. Thies, M. Karczmarek, J. Sermulins, R. Rabbah, and S. P. Amarasinghe. Teleport Messaging for Distributed Stream Programs. In PPoPP, 2005.

Digital Library

[43]

T. J. Van Hook. United States Patent#6,630,933: Method and Apparatus for Compression and Decompression of Z Data (Assignee ATI Technologies Inc.), October 2003.

[44]

S. Vangal et al. An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS. IEEE Journal of Solid-State Circuits, 43 (1): 29--41, Jan. 2008.

[45]

H. Wong et al. Demystifying GPU microarchitecture through microbenchmarking. In ISPASS, 2010.

[46]

M. Zheng, V. T. Ravi, F. Qin, and G. Agrawal. GRace: A Low-Overhead Mechanism for Detecting Data Races in GPU Programs. In PPoPP, 2011.

Digital Library

Cited By

Thoman PKnorr FCrisci L(2024)SimSYCL: A SYCL Implementation Targeting Development, Debugging, Simulation and ConformanceProceedings of the 12th International Workshop on OpenCL and SYCL10.1145/3648115.3648136(1-12)Online publication date: 8-Apr-2024
https://dl.acm.org/doi/10.1145/3648115.3648136
Zhang YWang MWang WMai YHuang HYu Z(2024)Atomic Cache: Enabling Efficient Fine-Grained Synchronization with Relaxed Memory Consistency on GPGPUs Through In-Cache Atomic Operations2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00056(671-685)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00056
Chen BWen MShi YLin DRajbahadur GJiang ZDwyer MDamian DZeller A(2022)Towards training reproducible deep learning modelsProceedings of the 44th International Conference on Software Engineering10.1145/3510003.3510163(2202-2214)Online publication date: 21-May-2022
https://dl.acm.org/doi/10.1145/3510003.3510163
Show More Cited By

Index Terms

GPUDet: a deterministic GPU architecture

Recommendations

GPUDet: a deterministic GPU architecture
ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems

Nondeterminism is a key challenge in developing multithreaded applications. Even with the same input, each execution of a multithreaded program may produce a different output. This behavior complicates debugging and limits one's ability to test for ...
GPUDet: a deterministic GPU architecture
ASPLOS '13

Nondeterminism is a key challenge in developing multithreaded applications. Even with the same input, each execution of a multithreaded program may produce a different output. This behavior complicates debugging and limits one's ability to test for ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 41, Issue 1

ASPLOS '13

March 2013

540 pages

ISSN:0163-5964

DOI:10.1145/2490301

Issue’s Table of Contents

ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
March 2013
574 pages
ISBN:9781450318709
DOI:10.1145/2451116
General Chair:
Vivek Sarkar
Rice University, USA
,
Program Chair:
Rastislav Bodik
University of California, Berkeley, USA

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 March 2013

Published in SIGARCH Volume 41, Issue 1

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
1,048
Total Downloads

Downloads (Last 12 months)30
Downloads (Last 6 weeks)5

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Thoman PKnorr FCrisci L(2024)SimSYCL: A SYCL Implementation Targeting Development, Debugging, Simulation and ConformanceProceedings of the 12th International Workshop on OpenCL and SYCL10.1145/3648115.3648136(1-12)Online publication date: 8-Apr-2024
https://dl.acm.org/doi/10.1145/3648115.3648136
Zhang YWang MWang WMai YHuang HYu Z(2024)Atomic Cache: Enabling Efficient Fine-Grained Synchronization with Relaxed Memory Consistency on GPGPUs Through In-Cache Atomic Operations2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00056(671-685)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00056
Chen BWen MShi YLin DRajbahadur GJiang ZDwyer MDamian DZeller A(2022)Towards training reproducible deep learning modelsProceedings of the 44th International Conference on Software Engineering10.1145/3510003.3510163(2202-2214)Online publication date: 21-May-2022
https://dl.acm.org/doi/10.1145/3510003.3510163
Abts DKimmell GLing AKim JBoyd MBitar AParmar SAhmed IDiCecco RHan DThompson JBye MHwang JFowers JLillian PMurthy AMehtabuddin ETekur CSohmers TKang KMaresh SRoss JSalapura VZahran MChong FTang L(2022)A software-defined tensor streaming multiprocessor for large-scale machine learningProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527405(567-580)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527405
Liu HNicolae BDi SCappello FJog A(2021)Accelerating DNN Architecture Search at Scale Using Selective Weight Transfer2021 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/Cluster48925.2021.00051(82-93)Online publication date: Sep-2021
https://doi.org/10.1109/Cluster48925.2021.00051
Chou YNg CCattell SIntan JSinclair MDevietti JRogers TAamodt T(2020)Deterministic Atomic Buffering2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00083(981-995)Online publication date: Oct-2020
https://doi.org/10.1109/MICRO50266.2020.00083
Merrifield TRoghanchi SDevietti JEriksson JBahar IHerlihy MWitchel ELebeck A(2019)Lazy Determinism for Faster Deterministic MultithreadingProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304047(879-891)Online publication date: 4-Apr-2019
https://dl.acm.org/doi/10.1145/3297858.3304047
Yazdanbakhsh ASong CSacks JLotfi-Kamran PEsmaeilzadeh HKim NEvripidou SStenström PO'Boyle M(2018)In-DRAM near-data approximate acceleration for GPUsProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243188(1-14)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1145/3243176.3243188
Li PHu XChen DBrock JLuo HZhang EDing C(2017)LDACM Transactions on Architecture and Code Optimization10.1145/304667814:1(1-25)Online publication date: 21-Mar-2017
https://dl.acm.org/doi/10.1145/3046678
Merrifield TDevietti JEriksson JRéveillère LHarris THerlihy M(2015)High-performance determinism with total store order consistencyProceedings of the Tenth European Conference on Computer Systems10.1145/2741948.2741960(1-13)Online publication date: 17-Apr-2015
https://dl.acm.org/doi/10.1145/2741948.2741960
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents