research-article

Minimizing GPU Kernel Launch Overhead in Deep Learning Inference on Mobile GPUs

Authors:

Sumin Kim,

Seunghwan Oh,

Youngmin YiAuthors Info & Claims

HotMobile '21: Proceedings of the 22nd International Workshop on Mobile Computing Systems and Applications

Pages 57 - 63

https://doi.org/10.1145/3446382.3448606

Published: 24 February 2021 Publication History

Get Access

Abstract

The need for on-device real-time Deep Learning inference is increasing as deep learning on edge devices such as smartphones and robots are becoming popular. Although hardware acceleration on NPU is attracting more attention, the recent mobile GPUs are fast enough to provide the potential to achieve real-time inference of many CNNs. In this paper, we first analyze the inference time of the widely used CNNs on the recent mobile GPUs and reveal that significant overhead exists for the GPU kernel launches. Then, we identify various factors that cause the kernel launch overhead, from which we formulate a performance model that can predict the optimal period for the kernel flush that can lead to the minimal overhead. Our experimental results show that we could achieve up to 64% and 31% of speedups in the inference of various CNNs with TensorFlow Lite and ARM Compute Library on Adreno 650 GPU and Mali G76 GPU.

References

[1]

Andrei Frumusanu. 2019. Galaxy Note10+- Full phone specifications. https://www.gsmarena.com/samsung_galaxy_note10+-9732.php.

Google Scholar

[2]

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).

Google Scholar

[3]

Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv preprint arXiv:1602.07360 (2016).

Google Scholar

[4]

Andrey Ignatov, Radu Timofte, William Chou, Ke Wang, Max Wu, Tim Hartley, and Luc Van Gool. 2018. Ai benchmark: Running deep neural networks on android smartphones. In ECCV. 0--0.

Google Scholar

[5]

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In CVPR. 2704--2713.

Google Scholar

[6]

Khronos® OpenCL Working Group. 2020. The OpenCLTM Specification. https://www.khronos.org/registry/OpenCL/specs/3.0-unified/pdf/OpenCL_API.pdf.

Google Scholar

[7]

Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2016. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710 (2016).

Google Scholar

[8]

Chanyoung Oh, Gunju Park, Sumin Kim, Dohee Kim, and Youngmin Yi. 2020. Towards Real-time CNN Inference from a Video Stream on a Mobile GPU (WiP Paper). In LCTES2020. 136--140.

Digital Library

Google Scholar

[9]

Qualcomm Technologies, Inc. 2019. Snapdragon 865 Mobile Hardware Development Kit. developer.qualcomm.com/hardware/snapdragon-865-hdk.

Google Scholar

[10]

Siqi Wang, Anuj Pathania, and Tulika Mitra. 2020. Neural Network Inference on Mobile SoCs. IEEE Design & Test (2020).

Crossref

Google Scholar

[11]

Lingqi Zhang, Mohamed Wahib, and Satoshi Matsuoka. 2019. Understanding the Overheads of Launching CUDA Kernels. In ICPP19.

Google Scholar

Cited By

View all

Huang PTsividis YSeok M(2024)INTIACC: A Programmable Floating- Point Accelerator for Partial Differential EquationsIEEE Journal of Solid-State Circuits10.1109/JSSC.2024.337930859:9(3058-3069)Online publication date: Sep-2024
https://doi.org/10.1109/JSSC.2024.3379308
Xu FXu JChen JChen LShang RZhou ZLiu F(2023) iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud IEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.323271534:3(812-827)Online publication date: 1-Mar-2023
https://doi.org/10.1109/TPDS.2022.3232715
Duan SWang DRen JLyu FZhang YWu HShen X(2023)Distributed Artificial Intelligence Empowered by End-Edge-Cloud Computing: A SurveyIEEE Communications Surveys & Tutorials10.1109/COMST.2022.321852725:1(591-624)Online publication date: Sep-2024
https://doi.org/10.1109/COMST.2022.3218527
Show More Cited By

Index Terms

Minimizing GPU Kernel Launch Overhead in Deep Learning Inference on Mobile GPUs
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
2. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
An OpenCL micro-benchmark suite for GPUs and CPUs

Open computing language (OpenCL) is a new industry standard for task-parallel and data-parallel heterogeneous computing on a variety of modern CPUs, GPUs, DSPs, and other microprocessor designs. OpenCL is vendor independent and hence not specialized for ...
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...

Comments

Information & Contributors

Information

Published In

HotMobile '21: Proceedings of the 22nd International Workshop on Mobile Computing Systems and Applications

February 2021

192 pages

ISBN:9781450383233

DOI:10.1145/3446382

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Samsung Research Funding & Incubation Center for Future Technology

Conference

HotMobile '21

Sponsor:

SIGMOBILE

HotMobile '21: The 22nd International Workshop on Mobile Computing Systems and Applications

February 24 - 26, 2021

Virtual, United Kingdom

Acceptance Rates

Overall Acceptance Rate 96 of 345 submissions, 28%

Upcoming Conference

HOTMOBILE '25

Sponsor:
sigmobile

The 26th International Workshop on Mobile Computing Systems and Applications

February 26 - 27, 2025

Palm Springs , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
561
Total Downloads

Downloads (Last 12 months)119
Downloads (Last 6 weeks)6

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Huang PTsividis YSeok M(2024)INTIACC: A Programmable Floating- Point Accelerator for Partial Differential EquationsIEEE Journal of Solid-State Circuits10.1109/JSSC.2024.337930859:9(3058-3069)Online publication date: Sep-2024
https://doi.org/10.1109/JSSC.2024.3379308
Xu FXu JChen JChen LShang RZhou ZLiu F(2023) iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud IEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.323271534:3(812-827)Online publication date: 1-Mar-2023
https://doi.org/10.1109/TPDS.2022.3232715
Duan SWang DRen JLyu FZhang YWu HShen X(2023)Distributed Artificial Intelligence Empowered by End-Edge-Cloud Computing: A SurveyIEEE Communications Surveys & Tutorials10.1109/COMST.2022.321852725:1(591-624)Online publication date: Sep-2024
https://doi.org/10.1109/COMST.2022.3218527
Zhang HLee S(2022)Robot Bionic Vision Technologies: A ReviewApplied Sciences10.3390/app1216797012:16(7970)Online publication date: 9-Aug-2022
https://doi.org/10.3390/app12167970
S.K PKesanapalli SSimmhan Y(2022)Characterizing the Performance of Accelerated Jetson Edge Devices for Training Deep Learning ModelsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35706046:3(1-26)Online publication date: 8-Dec-2022
https://dl.acm.org/doi/10.1145/3570604
Choi JKim JLim CLee SLee JSong DKim YBellavista PZhang KGherbi ABagchi SPatiño MDi Modica GGascon-Samson J(2022)GuardiaNNProceedings of the 23rd ACM/IFIP International Middleware Conference10.1145/3528535.3531513(15-28)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3528535.3531513
Park HLin FFalsafi BFerdman MLu SWenisch T(2022)GPUReplay: a 50-KB GPU stack for client MLProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507754(157-170)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3503222.3507754
Bleier NMubarik MChakraborty SKishore SKumar RSalapura VZahran MChong FTang L(2022)Rethinking programmable earable processorsProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527396(454-467)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527396
Meng JZhuang CChen PWahib MSchmidt BWang XLan HWu DDeng MWei YFeng S(2022)Automatic Generation of High-Performance Convolution Kernels on ARM CPUs for Deep LearningIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.314625733:11(2885-2899)Online publication date: 1-Nov-2022
https://doi.org/10.1109/TPDS.2022.3146257
Park JNazir ZKalmakhanbet BSabyrov S(2022)A CNN Inference micro-benchmark for Performance Analysis and Optimization on GPUs2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC53654.2022.9945449(486-491)Online publication date: 9-Oct-2022
https://doi.org/10.1109/SMC53654.2022.9945449
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

An OpenCL micro-benchmark suite for GPUs and CPUs

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs

Comments

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Upcoming Conference

Other Metrics

Article Metrics

Other Metrics

Cited By

Login options

Full Access

PDF

eReader

Abstract

References

Cited By

Index Terms

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

An OpenCL micro-benchmark suite for GPUs and CPUs

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations