research-article

Modeling GPU-CPU workloads and systems

Authors:

Gregory Diamos,

Sudhakar YalamanchiliAuthors Info & Claims

GPGPU-3: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units

Pages 31 - 42

https://doi.org/10.1145/1735688.1735696

Published: 14 March 2010 Publication History

Abstract

Heterogeneous systems, systems with multiple processors tailored for specialized tasks, are challenging programming environments. While it may be possible for domain experts to optimize a high performance application for a very specific and well documented system, it may not perform as well or even function on a different system. Developers who have less experience with either the application domain or the system architecture may devote a significant effort to writing a program that merely functions correctly. We believe that a comprehensive analysis and modeling frame-work is necessary to ease application development and automate program optimization on heterogeneous platforms.

This paper reports on an empirical evaluation of 25 CUDA applications on four GPUs and three CPUs, leveraging the Ocelot dynamic compiler infrastructure which can execute and instrument the same CUDA applications on either target. Using a combination of instrumentation and statistical analysis, we record 37 different metrics for each application and use them to derive relationships between program behavior and performance on heterogeneous processors. These relationships are then fed into a modeling frame-work that attempts to predict the performance of similar classes of applications on different processors. Most significantly, this study identifies several non-intuitive relationships between program characteristics and demonstrates that it is possible to accurately model CUDA kernel performance using only metrics that are available before a kernel is executed.

References

[1]

A. Bakhoda, G. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Boston, MA, USA, April 2009.

[2]

J. Boyle and R. Dykstra. A method of finding projections onto the intersection of convex sets in hilbert spaces. 37:28--47, 1986.

[3]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 44--54, Oct. 2009.

Digital Library

[4]

S. Collange, D. Defour, and D. Parello. Barra, a modular functional gpu simulator for gpgpu. Technical Report hal-00359342, 2009.

[5]

A. Corporation. The arm cortex-a9 processors. white paper, ARM, September 2009.

[6]

G. Diamos. The design and implementation ocelot's dynamic binary translator from ptx to multi-core x86. Technical report, CERCS, 2009.

[7]

L. Eeckhout, H. Vandierendonck, and K. De Bosschere. Designing computer architecture research workloads. Computer, 36(2):65--71, Feb 2003.

Digital Library

[8]

K. O. W. Group. The OpenCL Specification, December 2008.

[9]

V. Grover, S. Lee, and A. Kerr. Plang: Translating nvidia ptx language to llvm ir machine, 2009.

[10]

S. Hong and H. Kim. An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. SIGARCH Comput. Archit. News, 37(3):152--163, 2009.

Digital Library

[11]

IMPACT. The parboil benchmark suite, 2007.

[12]

Intel. Intel graphics media accelerator x3000. Technical report, 2009.

[13]

A. Kerr, G. Diamos, and S. Yalamanchili. A characterization and analysis of ptx kernels. Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, 2009.

Digital Library

[14]

C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO'04), Palo Alto, California, Mar 2004.

Digital Library

[15]

B. F. Manly. Multivariate statistical methods: a primer. Chapman & Hall, Ltd., London, UK, UK, 1986.

Digital Library

[16]

NVIDIA. NVIDIA Compute PTX: Parallel Thread Execution. NVIDIA Corporation, Santa Clara, California, 1.3 edition, October 2008.

[17]

NVIDIA. NVIDIA CUDA Compute Unified Device Architecture. NVIDIA Corporation, Santa Clara, California, 2.1 edition, October 2008.

[18]

A. Phansalkar, A. Joshi, and L. K. John. Analysis of redundancy and application balance in the spec cpu2006 benchmark suite. SIGARCH Comput. Archit. News, 35(2):412--423, 2007.

Digital Library

[19]

J. Stratton, S. Stone, and W. mei Hwu. Mcuda: An efficient implementation of cuda kernels on multi-cores. Technical Report IMPACT-08-01, University of Illinois at Urbana-Champaign, March 2008.

Cited By

Abdelkhalik HArafa YSanthi NPrajapati NBadawy A(2023)Modeling and Characterizing Shared and Local Memories of the Ampere GPUsProceedings of the International Symposium on Memory Systems10.1145/3631882.3631891(1-3)Online publication date: 2-Oct-2023
https://dl.acm.org/doi/10.1145/3631882.3631891
Li YSun YJog A(2023)Path Forward Beyond Simulators: Fast and Accurate GPU Execution Time Prediction for DNN WorkloadsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614277(380-394)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614277
Wang DZhang DWang DZhang D(2023)Taming Heterogeneity in Social Edge ComputingSocial Edge Computing10.1007/978-3-031-26936-3_4(49-70)Online publication date: 20-Feb-2023
https://doi.org/10.1007/978-3-031-26936-3_4
Show More Cited By

Recommendations

A framework for dynamically instrumenting GPU compute applications within GPU Ocelot
GPGPU-4: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units

In this paper we present the design and implementation of a dynamic instrumentation infrastructure for PTX programs that procedurally transforms kernels and manages related data structures. We show how performing instrumentation within the GPU Ocelot ...
Caracal: dynamic translation of runtime environments for GPUs
GPGPU-4: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units

Graphics Processing Units (GPU) have become the platform of choice for accelerating a large range of data parallel and task parallel applications. Both AMD and NVIDIA have developed GPU implementations targeted at the high performance computing market. ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

GPGPU-3: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units

March 2010

124 pages

ISBN:9781605589350

DOI:10.1145/1735688

General Chairs:
David Kaeli
Northeastern University, Boston, MA
,
Miriam Leeser
Northeastern University, Boston, MA

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 March 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

GPGPU-3

GPGPU-3: Third Workshop on General-Purpose Computation on Graphics Processing Units

March 14, 2010

Pennsylvania, Pittsburgh, USA

Acceptance Rates

Overall Acceptance Rate 57 of 129 submissions, 44%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

70
Total Citations
View Citations
1,709
Total Downloads

Downloads (Last 12 months)79
Downloads (Last 6 weeks)4

Reflects downloads up to 18 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Abdelkhalik HArafa YSanthi NPrajapati NBadawy A(2023)Modeling and Characterizing Shared and Local Memories of the Ampere GPUsProceedings of the International Symposium on Memory Systems10.1145/3631882.3631891(1-3)Online publication date: 2-Oct-2023
https://dl.acm.org/doi/10.1145/3631882.3631891
Li YSun YJog A(2023)Path Forward Beyond Simulators: Fast and Accurate GPU Execution Time Prediction for DNN WorkloadsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614277(380-394)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614277
Wang DZhang DWang DZhang D(2023)Taming Heterogeneity in Social Edge ComputingSocial Edge Computing10.1007/978-3-031-26936-3_4(49-70)Online publication date: 20-Feb-2023
https://doi.org/10.1007/978-3-031-26936-3_4
Lattuada MGianniti EArdagna DZhang L(2022)Performance prediction of deep learning applications training in GPU as a service systemsCluster Computing10.1007/s10586-021-03428-825:2(1279-1302)Online publication date: 14-Jan-2022
https://doi.org/10.1007/s10586-021-03428-8
Bhattacharya TPeng XMao JZhang CTakreeti TWang YCao TQin X(2022)Performance modeling for I/O‐intensive applications on virtual machinesConcurrency and Computation: Practice and Experience10.1002/cpe.682334:10Online publication date: 18-Jan-2022
https://doi.org/10.1002/cpe.6823
Stevens JKlöckner A(2021)A mechanism for balancing accuracy and scope in cross-machine black-box GPU performance modelingInternational Journal of High Performance Computing Applications10.1177/109434202092134034:6(589-614)Online publication date: 25-Feb-2021
https://dl.acm.org/doi/10.1177/1094342020921340
Yasudo RCoutinho JVarbanescu ALuk WAmano HBecker TGuo C(2021)Analytical Performance Estimation for Large-Scale Reconfigurable Dataflow PlatformsACM Transactions on Reconfigurable Technology and Systems10.1145/345274214:3(1-21)Online publication date: 12-Aug-2021
https://dl.acm.org/doi/10.1145/3452742
Gowanlock MFink ZKarsin BWright JHung CHong JBechini ASong E(2021)A study of work distribution and contention in database primitives on heterogeneous CPU/GPU architecturesProceedings of the 36th Annual ACM Symposium on Applied Computing10.1145/3412841.3441913(311-320)Online publication date: 22-Mar-2021
https://dl.acm.org/doi/10.1145/3412841.3441913
Lin WShi FWu WLi KWu GMohammed A(2020)A Taxonomy and Survey of Power Models and Power Modeling for Cloud ServersACM Computing Surveys10.1145/340620853:5(1-41)Online publication date: 28-Sep-2020
https://dl.acm.org/doi/10.1145/3406208
Wu HLiu WLin HWang C(2020)A Model-Based Software Solution for Simultaneous Multiple Kernels on GPUsACM Transactions on Architecture and Code Optimization10.1145/337713817:1(1-26)Online publication date: 4-Mar-2020
https://dl.acm.org/doi/10.1145/3377138
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents