Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1810085.1810105acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

An experimental approach to performance measurement of heterogeneous parallel applications using CUDA

Published: 02 June 2010 Publication History

Abstract

Heterogeneous parallel systems using GPU devices for application acceleration have garnered significant attention in the supercomputing community. However, to realize the full potential of GPU computing, application developers will require tools to measure and analyze accelerator performance with respect to the parallel execution as a whole. A performance measurement technology for the NVIDIA CUDA platform has been developed and integrated with the TAU parallel performance system. The design of the TAUcuda package is based on an experimental NVIDIA CUDA driver and associated runtime and device libraries. In any environment where the CUDA experimental driver is installed, TAUcuda can provide detailed performance information regarding the execution of GPU kernels and the interactions with the parallel program without any modification to the program source or executable code. The paper describes the TAUcuda technology and how it is integrated with the TAU measurement framework to provide integrated performance views. Various examples of TAUcuda use are presented, including CUDA SDK examples, a GPU version of the Linpack benchmark, and a scalable molecular dynamics application, NAMD.

References

[1]
Barcelona Supercomputing Center. Paraver. http://www.bsc.es/paraver/.
[2]
R. Bell, A. Malony, and S. Shende. A portable, extensible, and scalable tool for parallel performance profile analysis. In European Conference on Parallel Computing (EuroPar 2003), 2003.
[3]
S. Biersdorff, C. Lee, A. Malony, and L. Kale. Integrated performance views in charm++: Projections meets tau. In International Conference on Parallel Processing (ICPP), Sept. 2009.
[4]
H. Brunst, D. Kranzlmüller, and W. E. Nagel. Tools for Scalable Parallel Program Analysis - Vampir NG and DeWiz. Distributed and Parallel Systems, Cluster and Grid Computing, 777, 2004.
[5]
CAPS Entreprise. HMPP Workbench. http://www.caps-entreprise.com/hmpp/.
[6]
A. Danalis, G. Marin, C. McCurdy, J. Meredith, P. Roth, K. Spafford, V. Tipparaju, and J. Vetter. The scalable heterogeneous computing (shoc) benchmark suite. In GPGPU '10: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pages 63--74. ACM, 2010.
[7]
J. P. el al. Scalable molecular dynamics with namd. In Journal of Computational Chemistry, pages 1781--1802, Oct. 2005.
[8]
D. Hackenberg, H. Brunst, and W. Nagel. Tracing and visualization for cell broadband engine systems. In European Conference on ParallelProcessing (EuroPar 2008), volume LCNS 5168, pages 172--181. Springer, 2008.
[9]
L. Kale, E. Bohm, C. Mendes, T. Wilmarth, and G. Zheng. Programming petascale applications with charm++ and ampi. In D. Bader, editor, Petascale Computing: Algorithms and Applications, pages 421--441. Chapman & Hall / CRC Press, 2008.
[10]
A. Knüpfer, R. Brendel, H. Brunst, H. Mix, and W. E. Nagel. Introducing the Open Trace Format (OTF). In International Conference on Computational Science (ICCS 2006), volume 3992 of Springer Lecture Notes in Computer Science, pages 526--533, May 2006.
[11]
F. Massimilian. Accelerating linpack with cuda on heterogeneous clusters. In Workshop on General Purpose Processing on Graphics Processing Units (GPGPU), pages 46--51, Mar. 2009.
[12]
S. Mayanglambam, A. Malony, and M. Sottile. Performance measurement of applications with gpu acceleration using cuda. In International Conference on Parallel Computing (ParCo), Sept. 2009.
[13]
NVIDIA Corporation. NVIDIA Performance Toolkit, da-01800-001v03 edition, May 2006.
[14]
NVIDIA Corporation. NVIDIA CUDA Visual Profiler, 1.1 edition, 2007.
[15]
NVIDIA Corporation. NVIDIA Nexus, 2009. http://developer.nvidia.com/nexus/.
[16]
S. Shende and A. D. Malony. The TAU parallel performance system. The International Journal of High Performance Computing Applications, 20(2):287--331, Summer 2006.
[17]
STMicroelectronics. PGI Accelerator Compilers. http://www.pgroup.com/resources/accel/.
[18]
C. E. Wu, A. Bolmarcich, M. Snir, D. Wootton, F. Parpia, A. Chan, E. Lusk, and W. Gropp. From trace generation to visualization: A performance framework for distributed parallel systems. In High Performance Networking and Computing (SC00), Nov. 2000.

Cited By

View all
  • (2023)GPUscout: Locating Data Movement-related Bottlenecks on GPUsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624208(1392-1402)Online publication date: 12-Nov-2023
  • (2020)Performance Portability and Unified Profiling for Finite Element Methods on Parallel SystemsAdvances in Science, Technology and Engineering Systems Journal10.25046/aj0501165:1(119-127)Online publication date: Jan-2020
  • (2020)Identifying and (automatically) remedying performance problems in CPU/GPU applicationsProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392759(1-13)Online publication date: 29-Jun-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '10: Proceedings of the 24th ACM International Conference on Supercomputing
June 2010
365 pages
ISBN:9781450300186
DOI:10.1145/1810085
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 June 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPGPU
  2. performance tools
  3. profiling
  4. tracing

Qualifiers

  • Research-article

Funding Sources

Conference

ICS'10
Sponsor:
ICS'10: International Conference on Supercomputing
June 2 - 4, 2010
Ibaraki, Tsukuba, Japan

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2023)GPUscout: Locating Data Movement-related Bottlenecks on GPUsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624208(1392-1402)Online publication date: 12-Nov-2023
  • (2020)Performance Portability and Unified Profiling for Finite Element Methods on Parallel SystemsAdvances in Science, Technology and Engineering Systems Journal10.25046/aj0501165:1(119-127)Online publication date: Jan-2020
  • (2020)Identifying and (automatically) remedying performance problems in CPU/GPU applicationsProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392759(1-13)Online publication date: 29-Jun-2020
  • (2019)DiogenesProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356213(1-20)Online publication date: 17-Nov-2019
  • (2018)Unified Cross-Platform Profiling of Parallel C++ Applications2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)10.1109/PMBS.2018.8641652(57-62)Online publication date: Nov-2018
  • (2018)Exposing hidden performance opportunities in high performance GPU applicationsProceedings of the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing10.1109/CCGRID.2018.00045(301-310)Online publication date: 1-May-2018
  • (2017)Noise Inspector Tool2017 25th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)10.1109/PDP.2017.52(543-546)Online publication date: 2017
  • (2016)Auto-tuning Performance of MPI Parallel Programs Using Resource Management in Container-Based Virtual Cloud2016 IEEE 9th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD.2016.0078(545-552)Online publication date: Jun-2016
  • (2014)Portable and Transparent Host-Device Communication Optimization for GPGPU EnvironmentsProceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization10.1145/2581122.2544156(55-65)Online publication date: 15-Feb-2014
  • (2014)Portable and Transparent Host-Device Communication Optimization for GPGPU EnvironmentsProceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization10.1145/2544137.2544156(55-65)Online publication date: 15-Feb-2014
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media