research-article

Memory performance estimation of CUDA programs

Authors:

Aviral ShrivastavaAuthors Info & Claims

ACM Transactions on Embedded Computing Systems (TECS), Volume 13, Issue 2

Article No.: 21, Pages 1 - 22

https://doi.org/10.1145/2514641.2514648

Published: 30 September 2013 Publication History

Abstract

CUDA has successfully popularized GPU computing, and GPGPU applications are now used in various embedded systems. The CUDA programming model provides a simple interface to program on GPUs, but tuning GPGPU applications for high performance is still quite challenging. Programmers need to consider numerous architectural details, and small changes in source code, especially on the memory access pattern, can affect performance significantly. This makes it very difficult to optimize CUDA programs. This article presents CuMAPz, which is a tool to analyze and compare the memory performance of CUDA programs. CuMAPz can help programmers explore different ways of using shared and global memories, and optimize their program for efficient memory behavior. CuMAPz models several memory-performance-related factors: data reuse, global memory access coalescing, global memory latency hiding, shared memory bank conflict, channel skew, and branch divergence. Experimental results show that CuMAPz can accurately estimate performance with correlation coefficient of 0.96. By using CuMAPz to explore the memory access design space, we could improve the performance of our benchmarks by 30% more than the previous approach [Hong and Kim 2010].

References

[1]

Arm. 2013. ARM mali GPU. http://www.arm.com/products/multimedia/mali-graphics-hardware.

[2]

Bakhoda, A., Yuan, G., Fung, W., Wong, H., and Aamodt, T. 2009. Analyzing cuda workloads using a detailed gpu simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'09). 163--174.

[3]

Baskaran, M., Ramanujam, J., and Sadayappan, P. 2010. Automatic c-to-cuda code generation for affine programs. In Compiler Construction, R. Gupta, Ed., Lecture Notes in Computer Science, vol. 6011., Springer, 244--263.

Digital Library

[4]

Baskaran, M. M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., and Sadayappan, P. 2008. A compiler framework for optimization of affine loop nests for gpgpus. In Proceedings of the 22^nd Annual International Conference on Supercomputing (ICS'08). ACM Press, New York, 225--234.

Digital Library

[5]

Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J., Lee, S.-H., and Skadron, K. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC'09). 44--54.

Digital Library

[6]

Ge Intelligent Platforms. 2013. IPN250 single board computer. http://www.geip.com/products/3514.

[7]

Hong, S. and Kim, H. 2010. An integrated gpu power and performance model. In Proceedings of the 37^th Annual International Symposium on Computer Architecture (ISCA'10). ACM Press, New York, 280--289.

Digital Library

[8]

Issenin, I., Brockmeyer, E., Miranda, M., and Dutt, N. 2004. Data reuse analysis technique for software-controlled memory hierarchies. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE'04). 202--207.

Digital Library

[9]

Kolson, D., Nicolau, A., and Dutt, N. 1996. Elimination of redundant memory traffic in high-level synthesis. IEEE Trans. Comput.-Aid. Des. 15, 11, 1354--1363.

Digital Library

[10]

Leung, A., Vasilache, N., Meister, B., Baskaran, M., Wohlford, D., Bastoul, C., and Lethin, R. 2010. A mapping path for multi-gpgpu accelerated computers from a portable high level programming abstraction. In Proceedings of the 3^rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU'10). ACM Press, New York, 51--61.

Digital Library

[11]

Nvidia. 2013a. Board specification, tesla c1060 computing processor board. http://www.nvidia.com/docs/IO/56483/Tesla C1060 boardSpec v03.pdf.

[12]

Nvidia. 2013b. NVIDIA ion processors. http://www.nvidia.com/object/sff ion.html.

[13]

Nvidia. 2010a. NVIDIA cuda best practices guide, version 3.1. http://www.classes.cs.uchicago.edu/archive/2011/winter/32102/reading/CUDA_C_Best_Practices_Guide.pdf.

[14]

Nvidia. 2010b. NVIDIA cuda programming guide, version 3.1. Opencl. http://www.khronos.org/opencl/.

[15]

Ruetsch, G. and Micikevicius, P. 2009. Optimizing matrix transpose in cuda. http://www.cs.colostate.edu/&sim; cs675/MatrixTranspose.pdf.

[16]

Ryoo, S., Rodrigues, C. I., Baghsorkhi, S. S., Stone, S. S., Kirk, D. B., and Hwu, W.-M. W. 2008a. Optimization principles and application performance evaluation of a multithreaded gpu using cuda. In Proceedings of the 13^th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'08). ACM Press, New York, 73--82.

Digital Library

[17]

Ryoo, S., Rodrigues, C. I., Stone, S. S., Baghsorkhi, S. S., Zee Ueng, S., Stratton, J. A., and Hwu, W.-M. W. 2008b. Hwu. Program optimization space pruning for a multithreaded gpu. In Proceedings of the International Symposium on Code Generation and Optimization (CGO'08).

Digital Library

[18]

Techniscan. 2013. 3D breast cancer detection system using tesla. http://www.techniscanmedicalsystems.com.

[19]

Volkov, V. 2010. Better performance at lower occupancy. GPU Technology Conference. http://gpucomputing.net/&quest;q=node/5893.

[20]

Yang, Y., Xiang, P., Kong, J., and Zhou, H. 2010. A gpgpu compiler for memory optimization and parallelism management. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'10). ACM Press, New York, 86--97.

Digital Library

[21]

Zee Ueng, S., Lathara, M., Baghsorkhi, S. S., and Hwu, W.-M. W. 2008. Cuda-lite: Reducing gpu programming complexity. In Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science, vol. 5335, Springer, 1--15.

Digital Library

[22]

Zhang, Y. and Owens, J. D. 2011. A quantitative performance analysis model for gpu architectures. In Proceedings of the 17^th IEEE International Symposium on High-Performance Computer Architecture (HPCA'11).

Digital Library

Cited By

Vießmann HScholz S(2020)Effective Host-GPU Memory Management Through Code GenerationProceedings of the 32nd Symposium on Implementation and Application of Functional Languages10.1145/3462172.3462199(138-149)Online publication date: 2-Sep-2020
https://dl.acm.org/doi/10.1145/3462172.3462199
Nadal-Serrano JLopez-Vallejo M(2016)A Performance Study of CUDA UVM versus Manual Optimizations in a Real-World Setup: Application to a Monte Carlo Wave-Particle Event-Based Interaction ModelIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2015.246381327:6(1579-1588)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1109/TPDS.2015.2463813
Magoulès FAhamed A(2015)AlineaInternational Journal of High Performance Computing Applications10.1177/109434201557677429:3(284-310)Online publication date: 1-Aug-2015
https://dl.acm.org/doi/10.1177/1094342015576774
Show More Cited By

Index Terms

Memory performance estimation of CUDA programs

Recommendations

CuMAPz: a tool to analyze memory access patterns in CUDA
DAC '11: Proceedings of the 48th Design Automation Conference

CUDA programming model provides a simple interface to program on GPUs, but tuning GPGPU applications for high performance is still quite challenging. Programmers need to consider several architectural details, and small changes in source code, ...
A performance study of general-purpose applications on graphics processors using CUDA

Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...
Parallel implementation of MAFFT on CUDA-enabled graphics hardware

Multiple sequence alignment (MSA) constitutes an extremely powerful tool for many biological applications including phylogenetic tree estimation, secondary structure prediction, and critical residue identification. However, aligning large biological ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 13, Issue 2

Special issue on application-specific processors

September 2013

254 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/2514641

Issue’s Table of Contents

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 30 September 2013

Accepted: 01 May 2012

Revised: 01 February 2012

Received: 01 February 2011

Published in TECS Volume 13, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
510
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)1

Reflects downloads up to 03 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Vießmann HScholz S(2020)Effective Host-GPU Memory Management Through Code GenerationProceedings of the 32nd Symposium on Implementation and Application of Functional Languages10.1145/3462172.3462199(138-149)Online publication date: 2-Sep-2020
https://dl.acm.org/doi/10.1145/3462172.3462199
Nadal-Serrano JLopez-Vallejo M(2016)A Performance Study of CUDA UVM versus Manual Optimizations in a Real-World Setup: Application to a Monte Carlo Wave-Particle Event-Based Interaction ModelIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2015.246381327:6(1579-1588)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1109/TPDS.2015.2463813
Magoulès FAhamed A(2015)AlineaInternational Journal of High Performance Computing Applications10.1177/109434201557677429:3(284-310)Online publication date: 1-Aug-2015
https://dl.acm.org/doi/10.1177/1094342015576774
Jung YCarloni L(2015)ΣVPProceedings of the 52nd Annual Design Automation Conference10.1145/2744769.2744913(1-6)Online publication date: 7-Jun-2015
https://dl.acm.org/doi/10.1145/2744769.2744913
Mekbungwan PDevkota BGurung STunpan AKanchanasut KKitisin S(2013)Network coding based bulk data synchronization in mobile ad hoc networksProceedings of the 9th Asian Internet Engineering Conference10.1145/2534142.2534145(17-24)Online publication date: 13-Nov-2013
https://dl.acm.org/doi/10.1145/2534142.2534145

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents