Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Memory performance estimation of CUDA programs

Published: 30 September 2013 Publication History

Abstract

CUDA has successfully popularized GPU computing, and GPGPU applications are now used in various embedded systems. The CUDA programming model provides a simple interface to program on GPUs, but tuning GPGPU applications for high performance is still quite challenging. Programmers need to consider numerous architectural details, and small changes in source code, especially on the memory access pattern, can affect performance significantly. This makes it very difficult to optimize CUDA programs. This article presents CuMAPz, which is a tool to analyze and compare the memory performance of CUDA programs. CuMAPz can help programmers explore different ways of using shared and global memories, and optimize their program for efficient memory behavior. CuMAPz models several memory-performance-related factors: data reuse, global memory access coalescing, global memory latency hiding, shared memory bank conflict, channel skew, and branch divergence. Experimental results show that CuMAPz can accurately estimate performance with correlation coefficient of 0.96. By using CuMAPz to explore the memory access design space, we could improve the performance of our benchmarks by 30% more than the previous approach [Hong and Kim 2010].

References

[1]
Arm. 2013. ARM mali GPU. http://www.arm.com/products/multimedia/mali-graphics-hardware.
[2]
Bakhoda, A., Yuan, G., Fung, W., Wong, H., and Aamodt, T. 2009. Analyzing cuda workloads using a detailed gpu simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'09). 163--174.
[3]
Baskaran, M., Ramanujam, J., and Sadayappan, P. 2010. Automatic c-to-cuda code generation for affine programs. In Compiler Construction, R. Gupta, Ed., Lecture Notes in Computer Science, vol. 6011., Springer, 244--263.
[4]
Baskaran, M. M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., and Sadayappan, P. 2008. A compiler framework for optimization of affine loop nests for gpgpus. In Proceedings of the 22nd Annual International Conference on Supercomputing (ICS'08). ACM Press, New York, 225--234.
[5]
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J., Lee, S.-H., and Skadron, K. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC'09). 44--54.
[6]
Ge Intelligent Platforms. 2013. IPN250 single board computer. http://www.geip.com/products/3514.
[7]
Hong, S. and Kim, H. 2010. An integrated gpu power and performance model. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA'10). ACM Press, New York, 280--289.
[8]
Issenin, I., Brockmeyer, E., Miranda, M., and Dutt, N. 2004. Data reuse analysis technique for software-controlled memory hierarchies. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE'04). 202--207.
[9]
Kolson, D., Nicolau, A., and Dutt, N. 1996. Elimination of redundant memory traffic in high-level synthesis. IEEE Trans. Comput.-Aid. Des. 15, 11, 1354--1363.
[10]
Leung, A., Vasilache, N., Meister, B., Baskaran, M., Wohlford, D., Bastoul, C., and Lethin, R. 2010. A mapping path for multi-gpgpu accelerated computers from a portable high level programming abstraction. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU'10). ACM Press, New York, 51--61.
[11]
Nvidia. 2013a. Board specification, tesla c1060 computing processor board. http://www.nvidia.com/docs/IO/56483/Tesla C1060 boardSpec v03.pdf.
[12]
Nvidia. 2013b. NVIDIA ion processors. http://www.nvidia.com/object/sff ion.html.
[13]
Nvidia. 2010a. NVIDIA cuda best practices guide, version 3.1. http://www.classes.cs.uchicago.edu/archive/2011/winter/32102/reading/CUDA_C_Best_Practices_Guide.pdf.
[14]
Nvidia. 2010b. NVIDIA cuda programming guide, version 3.1. Opencl. http://www.khronos.org/opencl/.
[15]
Ruetsch, G. and Micikevicius, P. 2009. Optimizing matrix transpose in cuda. http://www.cs.colostate.edu/∼ cs675/MatrixTranspose.pdf.
[16]
Ryoo, S., Rodrigues, C. I., Baghsorkhi, S. S., Stone, S. S., Kirk, D. B., and Hwu, W.-M. W. 2008a. Optimization principles and application performance evaluation of a multithreaded gpu using cuda. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'08). ACM Press, New York, 73--82.
[17]
Ryoo, S., Rodrigues, C. I., Stone, S. S., Baghsorkhi, S. S., Zee Ueng, S., Stratton, J. A., and Hwu, W.-M. W. 2008b. Hwu. Program optimization space pruning for a multithreaded gpu. In Proceedings of the International Symposium on Code Generation and Optimization (CGO'08).
[18]
Techniscan. 2013. 3D breast cancer detection system using tesla. http://www.techniscanmedicalsystems.com.
[19]
Volkov, V. 2010. Better performance at lower occupancy. GPU Technology Conference. http://gpucomputing.net/?q=node/5893.
[20]
Yang, Y., Xiang, P., Kong, J., and Zhou, H. 2010. A gpgpu compiler for memory optimization and parallelism management. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'10). ACM Press, New York, 86--97.
[21]
Zee Ueng, S., Lathara, M., Baghsorkhi, S. S., and Hwu, W.-M. W. 2008. Cuda-lite: Reducing gpu programming complexity. In Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science, vol. 5335, Springer, 1--15.
[22]
Zhang, Y. and Owens, J. D. 2011. A quantitative performance analysis model for gpu architectures. In Proceedings of the 17th IEEE International Symposium on High-Performance Computer Architecture (HPCA'11).

Cited By

View all
  • (2020)Effective Host-GPU Memory Management Through Code GenerationProceedings of the 32nd Symposium on Implementation and Application of Functional Languages10.1145/3462172.3462199(138-149)Online publication date: 2-Sep-2020
  • (2016)A Performance Study of CUDA UVM versus Manual Optimizations in a Real-World Setup: Application to a Monte Carlo Wave-Particle Event-Based Interaction ModelIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2015.246381327:6(1579-1588)Online publication date: 1-Jun-2016
  • (2015)AlineaInternational Journal of High Performance Computing Applications10.1177/109434201557677429:3(284-310)Online publication date: 1-Aug-2015
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 13, Issue 2
Special issue on application-specific processors
September 2013
254 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/2514641
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 30 September 2013
Accepted: 01 May 2012
Revised: 01 February 2012
Received: 01 February 2011
Published in TECS Volume 13, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. CUDA
  2. GPGPU
  3. memory performance
  4. performance estimation
  5. program optimization

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)23
  • Downloads (Last 6 weeks)1
Reflects downloads up to 03 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2020)Effective Host-GPU Memory Management Through Code GenerationProceedings of the 32nd Symposium on Implementation and Application of Functional Languages10.1145/3462172.3462199(138-149)Online publication date: 2-Sep-2020
  • (2016)A Performance Study of CUDA UVM versus Manual Optimizations in a Real-World Setup: Application to a Monte Carlo Wave-Particle Event-Based Interaction ModelIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2015.246381327:6(1579-1588)Online publication date: 1-Jun-2016
  • (2015)AlineaInternational Journal of High Performance Computing Applications10.1177/109434201557677429:3(284-310)Online publication date: 1-Aug-2015
  • (2015)ΣVPProceedings of the 52nd Annual Design Automation Conference10.1145/2744769.2744913(1-6)Online publication date: 7-Jun-2015
  • (2013)Network coding based bulk data synchronization in mobile ad hoc networksProceedings of the 9th Asian Internet Engineering Conference10.1145/2534142.2534145(17-24)Online publication date: 13-Nov-2013

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media