research-article

An experimental approach to performance measurement of heterogeneous parallel applications using CUDA

Authors:

Allen D. Malony,

Scott Biersdorff,

Shangkar MayanglambamAuthors Info & Claims

ICS '10: Proceedings of the 24th ACM International Conference on Supercomputing

Pages 127 - 136

https://doi.org/10.1145/1810085.1810105

Published: 02 June 2010 Publication History

Abstract

Heterogeneous parallel systems using GPU devices for application acceleration have garnered significant attention in the supercomputing community. However, to realize the full potential of GPU computing, application developers will require tools to measure and analyze accelerator performance with respect to the parallel execution as a whole. A performance measurement technology for the NVIDIA CUDA platform has been developed and integrated with the TAU parallel performance system. The design of the TAUcuda package is based on an experimental NVIDIA CUDA driver and associated runtime and device libraries. In any environment where the CUDA experimental driver is installed, TAUcuda can provide detailed performance information regarding the execution of GPU kernels and the interactions with the parallel program without any modification to the program source or executable code. The paper describes the TAUcuda technology and how it is integrated with the TAU measurement framework to provide integrated performance views. Various examples of TAUcuda use are presented, including CUDA SDK examples, a GPU version of the Linpack benchmark, and a scalable molecular dynamics application, NAMD.

References

[1]

Barcelona Supercomputing Center. Paraver. http://www.bsc.es/paraver/.

[2]

R. Bell, A. Malony, and S. Shende. A portable, extensible, and scalable tool for parallel performance profile analysis. In European Conference on Parallel Computing (EuroPar 2003), 2003.

[3]

S. Biersdorff, C. Lee, A. Malony, and L. Kale. Integrated performance views in charm++: Projections meets tau. In International Conference on Parallel Processing (ICPP), Sept. 2009.

Digital Library

[4]

H. Brunst, D. Kranzlmüller, and W. E. Nagel. Tools for Scalable Parallel Program Analysis - Vampir NG and DeWiz. Distributed and Parallel Systems, Cluster and Grid Computing, 777, 2004.

[5]

CAPS Entreprise. HMPP Workbench. http://www.caps-entreprise.com/hmpp/.

[6]

A. Danalis, G. Marin, C. McCurdy, J. Meredith, P. Roth, K. Spafford, V. Tipparaju, and J. Vetter. The scalable heterogeneous computing (shoc) benchmark suite. In GPGPU '10: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pages 63--74. ACM, 2010.

Digital Library

[7]

J. P. el al. Scalable molecular dynamics with namd. In Journal of Computational Chemistry, pages 1781--1802, Oct. 2005.

[8]

D. Hackenberg, H. Brunst, and W. Nagel. Tracing and visualization for cell broadband engine systems. In European Conference on ParallelProcessing (EuroPar 2008), volume LCNS 5168, pages 172--181. Springer, 2008.

Digital Library

[9]

L. Kale, E. Bohm, C. Mendes, T. Wilmarth, and G. Zheng. Programming petascale applications with charm++ and ampi. In D. Bader, editor, Petascale Computing: Algorithms and Applications, pages 421--441. Chapman & Hall / CRC Press, 2008.

[10]

A. Knüpfer, R. Brendel, H. Brunst, H. Mix, and W. E. Nagel. Introducing the Open Trace Format (OTF). In International Conference on Computational Science (ICCS 2006), volume 3992 of Springer Lecture Notes in Computer Science, pages 526--533, May 2006.

Digital Library

[11]

F. Massimilian. Accelerating linpack with cuda on heterogeneous clusters. In Workshop on General Purpose Processing on Graphics Processing Units (GPGPU), pages 46--51, Mar. 2009.

Digital Library

[12]

S. Mayanglambam, A. Malony, and M. Sottile. Performance measurement of applications with gpu acceleration using cuda. In International Conference on Parallel Computing (ParCo), Sept. 2009.

[13]

NVIDIA Corporation. NVIDIA Performance Toolkit, da-01800-001v03 edition, May 2006.

[14]

NVIDIA Corporation. NVIDIA CUDA Visual Profiler, 1.1 edition, 2007.

[15]

NVIDIA Corporation. NVIDIA Nexus, 2009. http://developer.nvidia.com/nexus/.

[16]

S. Shende and A. D. Malony. The TAU parallel performance system. The International Journal of High Performance Computing Applications, 20(2):287--331, Summer 2006.

Digital Library

[17]

STMicroelectronics. PGI Accelerator Compilers. http://www.pgroup.com/resources/accel/.

[18]

C. E. Wu, A. Bolmarcich, M. Snir, D. Wootton, F. Parpia, A. Chan, E. Lusk, and W. Gropp. From trace generation to visualization: A performance framework for distributed parallel systems. In High Performance Networking and Computing (SC00), Nov. 2000.

Digital Library

Cited By

Sen SVanecek SSchulz M(2023)GPUscout: Locating Data Movement-related Bottlenecks on GPUsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624208(1392-1402)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624208
Kucher VHunloh JGorlatch S(2020)Performance Portability and Unified Profiling for Finite Element Methods on Parallel SystemsAdvances in Science, Technology and Engineering Systems Journal10.25046/aj0501165:1(119-127)Online publication date: Jan-2020
https://doi.org/10.25046/aj050116
Welton BMiller BAyguadé EHwu WBadia RHofstee H(2020)Identifying and (automatically) remedying performance problems in CPU/GPU applicationsProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392759(1-13)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3392717.3392759
Show More Cited By

Index Terms

An experimental approach to performance measurement of heterogeneous parallel applications using CUDA

Recommendations

A performance study of general-purpose applications on graphics processors using CUDA

Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...
Boosting CUDA Applications with CPU---GPU Hybrid Computing

This paper presents a cooperative heterogeneous computing framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU. The proposed system exploits at ...
Parallel implementation of MAFFT on CUDA-enabled graphics hardware

Multiple sequence alignment (MSA) constitutes an extremely powerful tool for many biological applications including phylogenetic tree estimation, secondary structure prediction, and critical residue identification. However, aligning large biological ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '10: Proceedings of the 24th ACM International Conference on Supercomputing

June 2010

365 pages

ISBN:9781450300186

DOI:10.1145/1810085

General Chair:
Taisuke Boku
University of Tsukuba
,
Program Chairs:
Hiroshi Nakashima
Kyoto University
,
Avi Mendelson
Microsoft

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 June 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

U.S. Department of Energy

Conference

ICS'10

Sponsor:

SIGARCH

ICS'10: International Conference on Supercomputing

June 2 - 4, 2010

Ibaraki, Tsukuba, Japan

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
770
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sen SVanecek SSchulz M(2023)GPUscout: Locating Data Movement-related Bottlenecks on GPUsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624208(1392-1402)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624208
Kucher VHunloh JGorlatch S(2020)Performance Portability and Unified Profiling for Finite Element Methods on Parallel SystemsAdvances in Science, Technology and Engineering Systems Journal10.25046/aj0501165:1(119-127)Online publication date: Jan-2020
https://doi.org/10.25046/aj050116
Welton BMiller BAyguadé EHwu WBadia RHofstee H(2020)Identifying and (automatically) remedying performance problems in CPU/GPU applicationsProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392759(1-13)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3392717.3392759
Welton BMiller BTaufer MBalaji PPeña A(2019)DiogenesProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356213(1-20)Online publication date: 17-Nov-2019
https://dl.acm.org/doi/10.1145/3295500.3356213
Kucher VFey FGorlatch S(2018)Unified Cross-Platform Profiling of Parallel C++ Applications2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)10.1109/PMBS.2018.8641652(57-62)Online publication date: Nov-2018
https://doi.org/10.1109/PMBS.2018.8641652
Welton BMiller BEl-Araby EEl-Ghazawi TPanda D(2018)Exposing hidden performance opportunities in high performance GPU applicationsProceedings of the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing10.1109/CCGRID.2018.00045(301-310)Online publication date: 1-May-2018
https://dl.acm.org/doi/10.1109/CCGRID.2018.00045
Utrera GFornes JLabarta J(2017)Noise Inspector Tool2017 25th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)10.1109/PDP.2017.52(543-546)Online publication date: 2017
https://doi.org/10.1109/PDP.2017.52
Ma HWang LTak BWang LTang C(2016)Auto-tuning Performance of MPI Parallel Programs Using Resource Management in Container-Based Virtual Cloud2016 IEEE 9th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD.2016.0078(545-552)Online publication date: Jun-2016
https://doi.org/10.1109/CLOUD.2016.0078
Margiolas CO'Boyle M(2014)Portable and Transparent Host-Device Communication Optimization for GPGPU EnvironmentsProceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization10.1145/2581122.2544156(55-65)Online publication date: 15-Feb-2014
https://dl.acm.org/doi/10.1145/2581122.2544156
Margiolas CO'Boyle M(2014)Portable and Transparent Host-Device Communication Optimization for GPGPU EnvironmentsProceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization10.1145/2544137.2544156(55-65)Online publication date: 15-Feb-2014
https://dl.acm.org/doi/10.1145/2544137.2544156
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents