research-article

CGPredict: Embedded GPU Performance Estimation from Single-Threaded Applications

Authors:

Tulika MitraAuthors Info & Claims

ACM Transactions on Embedded Computing Systems (TECS), Volume 16, Issue 5s

Article No.: 146, Pages 1 - 22

https://doi.org/10.1145/3126546

Published: 27 September 2017 Publication History

Abstract

Heterogeneous multiprocessor system-on-chip architectures are endowed with accelerators such as embedded GPUs and FPGAs capable of general-purpose computation. The application developers for such platforms need to carefully choose the accelerator with the maximum performance benefit. For a given application, usually, the reference code is specified in a high-level single-threaded programming language such as C. The performance of an application kernel on an accelerator is a complex interplay among the exposed parallelism, the compiler, and the accelerator architecture. Thus, determining the performance of a kernel requires its redevelopment into each accelerator-specific language, causing substantial wastage of time and effort. To aid the developer in this early design decision, we present an analytical framework CGPredict to predict the performance of a computational kernel on an embedded GPU architecture from un-optimized, single-threaded C code. The analytical approach provides insights on application characteristics which suggest further application-specific optimizations. The estimation error is as low as 2.66% (average 9%) compared to the performance of the same kernel written in native CUDA code running on NVIDIA Kepler embedded GPU. This low performance estimation error enables CGPredict to provide an early design recommendation of the accelerator starting from C code.

References

[1]

Newsha Ardalani, Clint Lestourgeon, Karthikeyan Sankaralingam, and Xiaojin Zhu. 2015. Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance. In 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’15). IEEE, 725--737.

Digital Library

[2]

Sara S. Baghsorkhi, Matthieu Delahaye, Sanjay J. Patel, William D. Gropp, and Wen-mei W. Hwu. 2010. An Adaptive Performance Modeling Tool for GPU Architectures. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’10). ACM, New York, NY, USA, 105--114.

Digital Library

[3]

Muthu Manikandan Baskaran, J. Ramanujam, and P. Sadayappan. 2010. Automatic C-to-CUDA code generation for affine programs. In Proceedings of the 19th Joint European Conference on Theory and Practice of Software, International Conference on Compiler Construction (CC’10/ETAPS’10). Springer-Verlag, Berlin, Heidelberg, 244--263.

Digital Library

[4]

Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Jason H. Anderson, Stephen Brown, and Tomasz Czajkowski. 2011. LegUp: High-level Synthesis for FPGA-based Processor/Accelerator Systems. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’11). ACM, New York, NY, USA, 33--36.

Digital Library

[5]

Jan Edler. 1998. Dinero IV trace-driven uniprocessor cache simulator. urlhttp://www.cs.wisc.edu/&sim;markhill/DineroIV/ (1998).

[6]

Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In 2012 Innovative Parallel Computing (InPar’12). IEEE, 1--10.

[7]

Sunpyo Hong and Hyesoon Kim. 2009. An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA’09). ACM, New York, NY, USA, 152--163.

Digital Library

[8]

Khronos. 2017. OpenCL: The open standard for parallel programming of heterogeneous systems. (2017). https://www.khronos.org/opencl/.

[9]

Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis 8 Transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO’04). IEEE Computer Society, Washington, DC, USA, 75--. http://dl.acm.org/citation.cfm?id=977395.977673

Digital Library

[10]

Yun Liang and Tulika Mitra. 2010. Instruction Cache Locking Using Temporal Reuse Profile. In Proceedings of the 47th Design Automation Conference (DAC’10). ACM, New York, NY, USA, 344--349.

Digital Library

[11]

Xinxin Mei and Xiaowen Chu. 2017. Dissecting GPU Memory Hierarchy through Microbenchmarking. IEEE Transactions on Parallel and Distributed Systems 28, 1 (Jan 2017), 72--86.

Digital Library

[12]

Jiayuan Meng, Vitali A. Morozov, Kalyan Kumaran, Venkatram Vishwanath, and Thomas D. Uram. 2011. GROPHECY: GPU Performance Projection from CPU Code Skeletons. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11). ACM, New York, NY, USA, Article 14, 11 pages.

Digital Library

[13]

Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, and Henri Bal. 2014. A detailed GPU cache model based on reuse distance theory. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 37--48.

[14]

NVIDIA. 2017. CUDA Toolkit Documentation. (2017). http://docs.nvidia.com/cuda/index.html.

[15]

NVIDIA. 2017. NVIDIA. CUDA C Programming Guide v8.0 2017. (2017). https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf.

[16]

NVIDIA. 2017. Parallel Thread Execution ISA Version 5.0. (2017). http://docs.nvidia.com/cuda/parallel-thread-execution.

[17]

NVIDIA. 2017. Tuning CUDA Applications for Kepler. (2017). http://docs.nvidia.com/cuda/kepler-tuning-guide/.

[18]

Arun Kumar Parakh, M. Balakrishnan, and Kolin Paul. 2012. Performance Estimation of GPUs with Cache. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops PhD Forum. 2384--2393.

Digital Library

[19]

Tao Tang, Xuejun Yang, and Yisong Lin. 2011. Cache Miss Analysis for GPU Programs Based on Stack Distance Profile. In 2011 31st International Conference on Distributed Computing Systems. IEEE, 623--634.

Digital Library

[20]

NVIDIA Tegra. 2014. K1: A New Era in Mobile Computing. Nvidia, Corp., White Paper (2014).

[21]

Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos. 2010. Demystifying GPU Microarchitecture through Microbenchmarking. In 2010 IEEE International Symposium on Performance Analysis of Systems Software (ISPASS’10). IEEE, 235--246.

[22]

Gene Wu, Joseph L. Greathouse, Alexander Lyashevsky, Nuwan Jayasena, and Derek Chiou. 2015. GPGPU performance and power estimation using machine learning. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 564--576.

[23]

Xilinx. 2017. Vivado design suite. (2017). https://www.xilinx.com/products/design-tools/vivado.html.

[24]

Xilinx. 2017. XILINX inc. (2017). http://www.xilinx.com.

[25]

Guanwen Zhong, Alok Prakash, Yun Liang, Tulika Mitra, and Smail Niar. 2016. Lin-analyzer: A High-level Performance Analysis Tool for FPGA-based Accelerators. In Proceedings of the 53rd Annual Design Automation Conference (DAC’16). ACM, New York, NY, USA, Article 136, 6 pages.

Digital Library

[26]

Guanwen Zhong, Alok Prakash, Siqi Wang, Yun Liang, Tulika Mitra, and Smail Niar. 2017. Design Space exploration of FPGA-based accelerators with multi-level parallelism. In Design, Automation Test in Europe Conference Exhibition, 2017 (DATE’17). IEEE, 1141--1146.

Digital Library

Cited By

Marantos CPapadopoulos LLamprakos CSalapas KSoudris D(2023)Bringing Energy Efficiency Closer to Application Developers: An Extensible Software Analysis FrameworkIEEE Transactions on Sustainable Computing10.1109/TSUSC.2022.32224098:2(180-193)Online publication date: 1-Apr-2023
https://doi.org/10.1109/TSUSC.2022.3222409
Darabi SMahani NBaxishi HYousefzadeh-Asl-Miandoab ESadrosadati MSarbazi-Azad H(2022)NURAProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35080366:1(1-27)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3508036
Alcorta EBrisk PGerstlauer A(2022)ML for System-Level ModelingMachine Learning Applications in Electronic Design Automation10.1007/978-3-031-13074-8_18(545-579)Online publication date: 10-Aug-2022
https://doi.org/10.1007/978-3-031-13074-8_18
Show More Cited By

Index Terms

CGPredict: Embedded GPU Performance Estimation from Single-Threaded Applications
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
    2. Parallel architectures
  2. Embedded and cyber-physical systems
    1. Embedded systems
2. Computing methodologies
  1. Modeling and simulation
    1. Model development and analysis

Recommendations

Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance
MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture

GPUs have become prevalent and more general purpose, but GPU programming remains challenging and time consuming for the majority of programmers. In addition, it is not always clear which codes will benefit from getting ported to GPU. Therefore, having a ...
Break down GPU execution time with an analytical method
RAPIDO '12: Proceedings of the 2012 Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools

Because modern GPGPU can provide significant computing power and has very high memory bandwidth, and also, developer-friendly programming interfaces such as CUDA have been introduced, GPGPU becomes more and more accepted in the HPC research area. Much ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 16, Issue 5s

Special Issue ESWEEK 2017, CASES 2017, CODES + ISSS 2017 and EMSOFT 2017

October 2017

1448 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/3145508

Editor:
Sandeep K. Shukla
Indian Institute of Technology, India

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 27 September 2017

Accepted: 01 July 2017

Revised: 01 June 2017

Received: 01 April 2017

Published in TECS Volume 16, Issue 5s

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
268
Total Downloads

Downloads (Last 12 months)24
Downloads (Last 6 weeks)2

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Marantos CPapadopoulos LLamprakos CSalapas KSoudris D(2023)Bringing Energy Efficiency Closer to Application Developers: An Extensible Software Analysis FrameworkIEEE Transactions on Sustainable Computing10.1109/TSUSC.2022.32224098:2(180-193)Online publication date: 1-Apr-2023
https://doi.org/10.1109/TSUSC.2022.3222409
Darabi SMahani NBaxishi HYousefzadeh-Asl-Miandoab ESadrosadati MSarbazi-Azad H(2022)NURAProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35080366:1(1-27)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3508036
Alcorta EBrisk PGerstlauer A(2022)ML for System-Level ModelingMachine Learning Applications in Electronic Design Automation10.1007/978-3-031-13074-8_18(545-579)Online publication date: 10-Aug-2022
https://doi.org/10.1007/978-3-031-13074-8_18
Wang TWu CTsao CChang YKuo T(2021)Scheduling-Aware Prefetching: Enabling the PCIe SSD to Extend the Global Memory of GPU Device2021 IEEE 10th Non-Volatile Memory Systems and Applications Symposium (NVMSA)10.1109/NVMSA53655.2021.9628829(1-6)Online publication date: 18-Aug-2021
https://doi.org/10.1109/NVMSA53655.2021.9628829
Wang XQian XKnoll AHuang K(2020)Efficient Performance Estimation and Work-Group Size Pruning for OpenCL Kernels on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.295834331:5(1089-1106)Online publication date: 1-May-2020
https://doi.org/10.1109/TPDS.2019.2958343
Prakash AWang SMitra T(2020)Mobile Application Processors: Techniques for Software Power-Performance OptimizationIEEE Consumer Electronics Magazine10.1109/MCE.2020.29691719:4(67-76)Online publication date: 1-Jul-2020
https://doi.org/10.1109/MCE.2020.2969171
Sapio ABhattacharyya SWolf M(2020)Efficient Model Solving for Markov Decision Processes2020 IEEE Symposium on Computers and Communications (ISCC)10.1109/ISCC50000.2020.9219668(1-5)Online publication date: Jul-2020
https://doi.org/10.1109/ISCC50000.2020.9219668
Shim YKim MChun MPark JKim YKim J(2019)Exploiting Process Similarity of 3D Flash Memory for High Performance SSDsProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3352460.3358311(211-223)Online publication date: 12-Oct-2019
https://dl.acm.org/doi/10.1145/3352460.3358311
Kang JJung DChung KHa S(2019)Fast Performance Estimation and Design Space Exploration of Manycore-based Neural ProcessorsProceedings of the 56th Annual Design Automation Conference 201910.1145/3316781.3317823(1-6)Online publication date: 2-Jun-2019
https://dl.acm.org/doi/10.1145/3316781.3317823
Wang SAnanthanarayanan GMitra T(2019)OPTiC: Optimizing Collaborative CPU–GPU Computing on Mobile Devices With Thermal ConstraintsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2018.287321038:3(393-406)Online publication date: Mar-2019
https://doi.org/10.1109/TCAD.2018.2873210
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents