Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

CGPredict: Embedded GPU Performance Estimation from Single-Threaded Applications

Published: 27 September 2017 Publication History

Abstract

Heterogeneous multiprocessor system-on-chip architectures are endowed with accelerators such as embedded GPUs and FPGAs capable of general-purpose computation. The application developers for such platforms need to carefully choose the accelerator with the maximum performance benefit. For a given application, usually, the reference code is specified in a high-level single-threaded programming language such as C. The performance of an application kernel on an accelerator is a complex interplay among the exposed parallelism, the compiler, and the accelerator architecture. Thus, determining the performance of a kernel requires its redevelopment into each accelerator-specific language, causing substantial wastage of time and effort. To aid the developer in this early design decision, we present an analytical framework CGPredict to predict the performance of a computational kernel on an embedded GPU architecture from un-optimized, single-threaded C code. The analytical approach provides insights on application characteristics which suggest further application-specific optimizations. The estimation error is as low as 2.66% (average 9%) compared to the performance of the same kernel written in native CUDA code running on NVIDIA Kepler embedded GPU. This low performance estimation error enables CGPredict to provide an early design recommendation of the accelerator starting from C code.

References

[1]
Newsha Ardalani, Clint Lestourgeon, Karthikeyan Sankaralingam, and Xiaojin Zhu. 2015. Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance. In 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’15). IEEE, 725--737.
[2]
Sara S. Baghsorkhi, Matthieu Delahaye, Sanjay J. Patel, William D. Gropp, and Wen-mei W. Hwu. 2010. An Adaptive Performance Modeling Tool for GPU Architectures. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’10). ACM, New York, NY, USA, 105--114.
[3]
Muthu Manikandan Baskaran, J. Ramanujam, and P. Sadayappan. 2010. Automatic C-to-CUDA code generation for affine programs. In Proceedings of the 19th Joint European Conference on Theory and Practice of Software, International Conference on Compiler Construction (CC’10/ETAPS’10). Springer-Verlag, Berlin, Heidelberg, 244--263.
[4]
Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Jason H. Anderson, Stephen Brown, and Tomasz Czajkowski. 2011. LegUp: High-level Synthesis for FPGA-based Processor/Accelerator Systems. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’11). ACM, New York, NY, USA, 33--36.
[5]
Jan Edler. 1998. Dinero IV trace-driven uniprocessor cache simulator. urlhttp://www.cs.wisc.edu/∼markhill/DineroIV/ (1998).
[6]
Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In 2012 Innovative Parallel Computing (InPar’12). IEEE, 1--10.
[7]
Sunpyo Hong and Hyesoon Kim. 2009. An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA’09). ACM, New York, NY, USA, 152--163.
[8]
Khronos. 2017. OpenCL: The open standard for parallel programming of heterogeneous systems. (2017). https://www.khronos.org/opencl/.
[9]
Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis 8 Transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO’04). IEEE Computer Society, Washington, DC, USA, 75--. http://dl.acm.org/citation.cfm?id=977395.977673
[10]
Yun Liang and Tulika Mitra. 2010. Instruction Cache Locking Using Temporal Reuse Profile. In Proceedings of the 47th Design Automation Conference (DAC’10). ACM, New York, NY, USA, 344--349.
[11]
Xinxin Mei and Xiaowen Chu. 2017. Dissecting GPU Memory Hierarchy through Microbenchmarking. IEEE Transactions on Parallel and Distributed Systems 28, 1 (Jan 2017), 72--86.
[12]
Jiayuan Meng, Vitali A. Morozov, Kalyan Kumaran, Venkatram Vishwanath, and Thomas D. Uram. 2011. GROPHECY: GPU Performance Projection from CPU Code Skeletons. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11). ACM, New York, NY, USA, Article 14, 11 pages.
[13]
Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, and Henri Bal. 2014. A detailed GPU cache model based on reuse distance theory. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 37--48.
[14]
NVIDIA. 2017. CUDA Toolkit Documentation. (2017). http://docs.nvidia.com/cuda/index.html.
[15]
NVIDIA. 2017. NVIDIA. CUDA C Programming Guide v8.0 2017. (2017). https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf.
[16]
NVIDIA. 2017. Parallel Thread Execution ISA Version 5.0. (2017). http://docs.nvidia.com/cuda/parallel-thread-execution.
[17]
NVIDIA. 2017. Tuning CUDA Applications for Kepler. (2017). http://docs.nvidia.com/cuda/kepler-tuning-guide/.
[18]
Arun Kumar Parakh, M. Balakrishnan, and Kolin Paul. 2012. Performance Estimation of GPUs with Cache. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops PhD Forum. 2384--2393.
[19]
Tao Tang, Xuejun Yang, and Yisong Lin. 2011. Cache Miss Analysis for GPU Programs Based on Stack Distance Profile. In 2011 31st International Conference on Distributed Computing Systems. IEEE, 623--634.
[20]
NVIDIA Tegra. 2014. K1: A New Era in Mobile Computing. Nvidia, Corp., White Paper (2014).
[21]
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos. 2010. Demystifying GPU Microarchitecture through Microbenchmarking. In 2010 IEEE International Symposium on Performance Analysis of Systems Software (ISPASS’10). IEEE, 235--246.
[22]
Gene Wu, Joseph L. Greathouse, Alexander Lyashevsky, Nuwan Jayasena, and Derek Chiou. 2015. GPGPU performance and power estimation using machine learning. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 564--576.
[23]
Xilinx. 2017. Vivado design suite. (2017). https://www.xilinx.com/products/design-tools/vivado.html.
[24]
Xilinx. 2017. XILINX inc. (2017). http://www.xilinx.com.
[25]
Guanwen Zhong, Alok Prakash, Yun Liang, Tulika Mitra, and Smail Niar. 2016. Lin-analyzer: A High-level Performance Analysis Tool for FPGA-based Accelerators. In Proceedings of the 53rd Annual Design Automation Conference (DAC’16). ACM, New York, NY, USA, Article 136, 6 pages.
[26]
Guanwen Zhong, Alok Prakash, Siqi Wang, Yun Liang, Tulika Mitra, and Smail Niar. 2017. Design Space exploration of FPGA-based accelerators with multi-level parallelism. In Design, Automation Test in Europe Conference Exhibition, 2017 (DATE’17). IEEE, 1141--1146.

Cited By

View all
  • (2023)Bringing Energy Efficiency Closer to Application Developers: An Extensible Software Analysis FrameworkIEEE Transactions on Sustainable Computing10.1109/TSUSC.2022.32224098:2(180-193)Online publication date: 1-Apr-2023
  • (2022)NURAProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35080366:1(1-27)Online publication date: 28-Feb-2022
  • (2022)ML for System-Level ModelingMachine Learning Applications in Electronic Design Automation10.1007/978-3-031-13074-8_18(545-579)Online publication date: 10-Aug-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 16, Issue 5s
Special Issue ESWEEK 2017, CASES 2017, CODES + ISSS 2017 and EMSOFT 2017
October 2017
1448 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3145508
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 27 September 2017
Accepted: 01 July 2017
Revised: 01 June 2017
Received: 01 April 2017
Published in TECS Volume 16, Issue 5s

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPGPU
  2. Heterogenous platform
  3. analytical model
  4. cross-platform prediction
  5. mobile platform
  6. performance modeling

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)24
  • Downloads (Last 6 weeks)2
Reflects downloads up to 22 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Bringing Energy Efficiency Closer to Application Developers: An Extensible Software Analysis FrameworkIEEE Transactions on Sustainable Computing10.1109/TSUSC.2022.32224098:2(180-193)Online publication date: 1-Apr-2023
  • (2022)NURAProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35080366:1(1-27)Online publication date: 28-Feb-2022
  • (2022)ML for System-Level ModelingMachine Learning Applications in Electronic Design Automation10.1007/978-3-031-13074-8_18(545-579)Online publication date: 10-Aug-2022
  • (2021)Scheduling-Aware Prefetching: Enabling the PCIe SSD to Extend the Global Memory of GPU Device2021 IEEE 10th Non-Volatile Memory Systems and Applications Symposium (NVMSA)10.1109/NVMSA53655.2021.9628829(1-6)Online publication date: 18-Aug-2021
  • (2020)Efficient Performance Estimation and Work-Group Size Pruning for OpenCL Kernels on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.295834331:5(1089-1106)Online publication date: 1-May-2020
  • (2020)Mobile Application Processors: Techniques for Software Power-Performance OptimizationIEEE Consumer Electronics Magazine10.1109/MCE.2020.29691719:4(67-76)Online publication date: 1-Jul-2020
  • (2020)Efficient Model Solving for Markov Decision Processes2020 IEEE Symposium on Computers and Communications (ISCC)10.1109/ISCC50000.2020.9219668(1-5)Online publication date: Jul-2020
  • (2019)Exploiting Process Similarity of 3D Flash Memory for High Performance SSDsProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3352460.3358311(211-223)Online publication date: 12-Oct-2019
  • (2019)Fast Performance Estimation and Design Space Exploration of Manycore-based Neural ProcessorsProceedings of the 56th Annual Design Automation Conference 201910.1145/3316781.3317823(1-6)Online publication date: 2-Jun-2019
  • (2019)OPTiC: Optimizing Collaborative CPU–GPU Computing on Mobile Devices With Thermal ConstraintsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2018.287321038:3(393-406)Online publication date: Mar-2019
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media