Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1735688.1735696acmotherconferencesArticle/Chapter ViewAbstractPublication PagesgpgpuConference Proceedingsconference-collections
research-article

Modeling GPU-CPU workloads and systems

Published: 14 March 2010 Publication History

Abstract

Heterogeneous systems, systems with multiple processors tailored for specialized tasks, are challenging programming environments. While it may be possible for domain experts to optimize a high performance application for a very specific and well documented system, it may not perform as well or even function on a different system. Developers who have less experience with either the application domain or the system architecture may devote a significant effort to writing a program that merely functions correctly. We believe that a comprehensive analysis and modeling frame-work is necessary to ease application development and automate program optimization on heterogeneous platforms.
This paper reports on an empirical evaluation of 25 CUDA applications on four GPUs and three CPUs, leveraging the Ocelot dynamic compiler infrastructure which can execute and instrument the same CUDA applications on either target. Using a combination of instrumentation and statistical analysis, we record 37 different metrics for each application and use them to derive relationships between program behavior and performance on heterogeneous processors. These relationships are then fed into a modeling frame-work that attempts to predict the performance of similar classes of applications on different processors. Most significantly, this study identifies several non-intuitive relationships between program characteristics and demonstrates that it is possible to accurately model CUDA kernel performance using only metrics that are available before a kernel is executed.

References

[1]
A. Bakhoda, G. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Boston, MA, USA, April 2009.
[2]
J. Boyle and R. Dykstra. A method of finding projections onto the intersection of convex sets in hilbert spaces. 37:28--47, 1986.
[3]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 44--54, Oct. 2009.
[4]
S. Collange, D. Defour, and D. Parello. Barra, a modular functional gpu simulator for gpgpu. Technical Report hal-00359342, 2009.
[5]
A. Corporation. The arm cortex-a9 processors. white paper, ARM, September 2009.
[6]
G. Diamos. The design and implementation ocelot's dynamic binary translator from ptx to multi-core x86. Technical report, CERCS, 2009.
[7]
L. Eeckhout, H. Vandierendonck, and K. De Bosschere. Designing computer architecture research workloads. Computer, 36(2):65--71, Feb 2003.
[8]
K. O. W. Group. The OpenCL Specification, December 2008.
[9]
V. Grover, S. Lee, and A. Kerr. Plang: Translating nvidia ptx language to llvm ir machine, 2009.
[10]
S. Hong and H. Kim. An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. SIGARCH Comput. Archit. News, 37(3):152--163, 2009.
[11]
IMPACT. The parboil benchmark suite, 2007.
[12]
Intel. Intel graphics media accelerator x3000. Technical report, 2009.
[13]
A. Kerr, G. Diamos, and S. Yalamanchili. A characterization and analysis of ptx kernels. Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, 2009.
[14]
C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO'04), Palo Alto, California, Mar 2004.
[15]
B. F. Manly. Multivariate statistical methods: a primer. Chapman & Hall, Ltd., London, UK, UK, 1986.
[16]
NVIDIA. NVIDIA Compute PTX: Parallel Thread Execution. NVIDIA Corporation, Santa Clara, California, 1.3 edition, October 2008.
[17]
NVIDIA. NVIDIA CUDA Compute Unified Device Architecture. NVIDIA Corporation, Santa Clara, California, 2.1 edition, October 2008.
[18]
A. Phansalkar, A. Joshi, and L. K. John. Analysis of redundancy and application balance in the spec cpu2006 benchmark suite. SIGARCH Comput. Archit. News, 35(2):412--423, 2007.
[19]
J. Stratton, S. Stone, and W. mei Hwu. Mcuda: An efficient implementation of cuda kernels on multi-cores. Technical Report IMPACT-08-01, University of Illinois at Urbana-Champaign, March 2008.

Cited By

View all
  • (2023)Modeling and Characterizing Shared and Local Memories of the Ampere GPUsProceedings of the International Symposium on Memory Systems10.1145/3631882.3631891(1-3)Online publication date: 2-Oct-2023
  • (2023)Path Forward Beyond Simulators: Fast and Accurate GPU Execution Time Prediction for DNN WorkloadsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614277(380-394)Online publication date: 28-Oct-2023
  • (2023)Taming Heterogeneity in Social Edge ComputingSocial Edge Computing10.1007/978-3-031-26936-3_4(49-70)Online publication date: 20-Feb-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
GPGPU-3: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
March 2010
124 pages
ISBN:9781605589350
DOI:10.1145/1735688
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 March 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. CUDA
  2. GPGPU
  3. Ocelot
  4. OpenCL
  5. PTX
  6. Rodinia
  7. parboil

Qualifiers

  • Research-article

Conference

GPGPU-3

Acceptance Rates

Overall Acceptance Rate 57 of 129 submissions, 44%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)79
  • Downloads (Last 6 weeks)4
Reflects downloads up to 18 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Modeling and Characterizing Shared and Local Memories of the Ampere GPUsProceedings of the International Symposium on Memory Systems10.1145/3631882.3631891(1-3)Online publication date: 2-Oct-2023
  • (2023)Path Forward Beyond Simulators: Fast and Accurate GPU Execution Time Prediction for DNN WorkloadsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614277(380-394)Online publication date: 28-Oct-2023
  • (2023)Taming Heterogeneity in Social Edge ComputingSocial Edge Computing10.1007/978-3-031-26936-3_4(49-70)Online publication date: 20-Feb-2023
  • (2022)Performance prediction of deep learning applications training in GPU as a service systemsCluster Computing10.1007/s10586-021-03428-825:2(1279-1302)Online publication date: 14-Jan-2022
  • (2022)Performance modeling for I/O‐intensive applications on virtual machinesConcurrency and Computation: Practice and Experience10.1002/cpe.682334:10Online publication date: 18-Jan-2022
  • (2021)A mechanism for balancing accuracy and scope in cross-machine black-box GPU performance modelingInternational Journal of High Performance Computing Applications10.1177/109434202092134034:6(589-614)Online publication date: 25-Feb-2021
  • (2021)Analytical Performance Estimation for Large-Scale Reconfigurable Dataflow PlatformsACM Transactions on Reconfigurable Technology and Systems10.1145/345274214:3(1-21)Online publication date: 12-Aug-2021
  • (2021)A study of work distribution and contention in database primitives on heterogeneous CPU/GPU architecturesProceedings of the 36th Annual ACM Symposium on Applied Computing10.1145/3412841.3441913(311-320)Online publication date: 22-Mar-2021
  • (2020)A Taxonomy and Survey of Power Models and Power Modeling for Cloud ServersACM Computing Surveys10.1145/340620853:5(1-41)Online publication date: 28-Sep-2020
  • (2020)A Model-Based Software Solution for Simultaneous Multiple Kernels on GPUsACM Transactions on Architecture and Code Optimization10.1145/337713817:1(1-26)Online publication date: 4-Mar-2020
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media