Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3446382.3448606acmconferencesArticle/Chapter ViewAbstractPublication PageshotmobileConference Proceedingsconference-collections
research-article

Minimizing GPU Kernel Launch Overhead in Deep Learning Inference on Mobile GPUs

Published: 24 February 2021 Publication History

Abstract

The need for on-device real-time Deep Learning inference is increasing as deep learning on edge devices such as smartphones and robots are becoming popular. Although hardware acceleration on NPU is attracting more attention, the recent mobile GPUs are fast enough to provide the potential to achieve real-time inference of many CNNs. In this paper, we first analyze the inference time of the widely used CNNs on the recent mobile GPUs and reveal that significant overhead exists for the GPU kernel launches. Then, we identify various factors that cause the kernel launch overhead, from which we formulate a performance model that can predict the optimal period for the kernel flush that can lead to the minimal overhead. Our experimental results show that we could achieve up to 64% and 31% of speedups in the inference of various CNNs with TensorFlow Lite and ARM Compute Library on Adreno 650 GPU and Mali G76 GPU.

References

[1]
Andrei Frumusanu. 2019. Galaxy Note10+- Full phone specifications. https://www.gsmarena.com/samsung_galaxy_note10+-9732.php.
[2]
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).
[3]
Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv preprint arXiv:1602.07360 (2016).
[4]
Andrey Ignatov, Radu Timofte, William Chou, Ke Wang, Max Wu, Tim Hartley, and Luc Van Gool. 2018. Ai benchmark: Running deep neural networks on android smartphones. In ECCV. 0--0.
[5]
Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In CVPR. 2704--2713.
[6]
Khronos® OpenCL Working Group. 2020. The OpenCLTM Specification. https://www.khronos.org/registry/OpenCL/specs/3.0-unified/pdf/OpenCL_API.pdf.
[7]
Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2016. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710 (2016).
[8]
Chanyoung Oh, Gunju Park, Sumin Kim, Dohee Kim, and Youngmin Yi. 2020. Towards Real-time CNN Inference from a Video Stream on a Mobile GPU (WiP Paper). In LCTES2020. 136--140.
[9]
Qualcomm Technologies, Inc. 2019. Snapdragon 865 Mobile Hardware Development Kit. developer.qualcomm.com/hardware/snapdragon-865-hdk.
[10]
Siqi Wang, Anuj Pathania, and Tulika Mitra. 2020. Neural Network Inference on Mobile SoCs. IEEE Design & Test (2020).
[11]
Lingqi Zhang, Mohamed Wahib, and Satoshi Matsuoka. 2019. Understanding the Overheads of Launching CUDA Kernels. In ICPP19.

Cited By

View all
  • (2024)INTIACC: A Programmable Floating- Point Accelerator for Partial Differential EquationsIEEE Journal of Solid-State Circuits10.1109/JSSC.2024.337930859:9(3058-3069)Online publication date: Sep-2024
  • (2023) iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud IEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.323271534:3(812-827)Online publication date: 1-Mar-2023
  • (2023)Distributed Artificial Intelligence Empowered by End-Edge-Cloud Computing: A SurveyIEEE Communications Surveys & Tutorials10.1109/COMST.2022.321852725:1(591-624)Online publication date: Sep-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HotMobile '21: Proceedings of the 22nd International Workshop on Mobile Computing Systems and Applications
February 2021
192 pages
ISBN:9781450383233
DOI:10.1145/3446382
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Deep Learning
  2. Kernel Launch Overhead
  3. Mobile GPU
  4. OpenCL

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • Samsung Research Funding & Incubation Center for Future Technology

Conference

HotMobile '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 96 of 345 submissions, 28%

Upcoming Conference

HOTMOBILE '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)119
  • Downloads (Last 6 weeks)6
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)INTIACC: A Programmable Floating- Point Accelerator for Partial Differential EquationsIEEE Journal of Solid-State Circuits10.1109/JSSC.2024.337930859:9(3058-3069)Online publication date: Sep-2024
  • (2023) iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud IEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.323271534:3(812-827)Online publication date: 1-Mar-2023
  • (2023)Distributed Artificial Intelligence Empowered by End-Edge-Cloud Computing: A SurveyIEEE Communications Surveys & Tutorials10.1109/COMST.2022.321852725:1(591-624)Online publication date: Sep-2024
  • (2022)Robot Bionic Vision Technologies: A ReviewApplied Sciences10.3390/app1216797012:16(7970)Online publication date: 9-Aug-2022
  • (2022)Characterizing the Performance of Accelerated Jetson Edge Devices for Training Deep Learning ModelsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35706046:3(1-26)Online publication date: 8-Dec-2022
  • (2022)GuardiaNNProceedings of the 23rd ACM/IFIP International Middleware Conference10.1145/3528535.3531513(15-28)Online publication date: 7-Nov-2022
  • (2022)GPUReplay: a 50-KB GPU stack for client MLProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507754(157-170)Online publication date: 28-Feb-2022
  • (2022)Rethinking programmable earable processorsProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527396(454-467)Online publication date: 18-Jun-2022
  • (2022)Automatic Generation of High-Performance Convolution Kernels on ARM CPUs for Deep LearningIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.314625733:11(2885-2899)Online publication date: 1-Nov-2022
  • (2022)A CNN Inference micro-benchmark for Performance Analysis and Optimization on GPUs2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC53654.2022.9945449(486-491)Online publication date: 9-Oct-2022
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media