research-article

FusionCL: a machine-learning based approach for OpenCL kernel fusion to increase system performance

Authors:

Yasir Noman Khalid,

Muhammad Aleem,

Muhammad Arshad Islam,

Muhammad Azhar IqbalAuthors Info & Claims

Computing, Volume 103, Issue 10

Pages 2171 - 2202

https://doi.org/10.1007/s00607-021-00958-2

Published: 01 October 2021 Publication History

Abstract

Employing general-purpose graphics processing units (GPGPU) with the help of OpenCL has resulted in greatly reducing the execution time of data-parallel applications by taking advantage of the massive available parallelism. However, when a small data size application is executed on GPU there is a wastage of GPU resources as the application cannot fully utilize GPU compute-cores. There is no mechanism to share a GPU between two kernels due to the lack of operating system support on GPU. In this paper, we propose the provision of a GPU sharing mechanism between two kernels that will lead to increasing GPU occupancy, and as a result, reduce execution time of a job pool. However, if a pair of the kernel is competing for the same set of resources (i.e., both applications are compute-intensive or memory-intensive), kernel fusion may also result in a significant increase in execution time of fused kernels. Therefore, it is pertinent to select an optimal pair of kernels for fusion that will result in significant speedup over their serial execution. This research presents FusionCL, a machine learning-based GPU sharing mechanism between a pair of OpenCL kernels. FusionCL identifies each pair of kernels (from the job pool), which are suitable candidates for fusion using a machine learning-based fusion suitability classifier. Thereafter, from all the candidates, it selects a pair of candidate kernels that will produce maximum speedup after fusion over their serial execution using a fusion speedup predictor. The experimental evaluation shows that the proposed kernel fusion mechanism reduces execution time by 2.83× when compared to a baseline scheduling scheme. When compared to state-of-the-art, the reduction in execution time is up to 8%.

References

[1]

Rausch T, Rashed R, and Dustdar S Optimized container scheduling for data-intensive serverless edge computing Futur Gener Comput Syst 2021 114 259-271

[2]

Khalid YN, Aleem M, Ahmed U, Islam MA, and Iqbal MA Troodon: a machine-learning based load-balancing application scheduler for CPU–GPU system J Parallel Distrib Comput 2019 132 79-94

[3]

Rohr D et al (2014) An energy-efficient multi-GPU supercomputer. In: 2014 IEEE international conference on high performance computing and communications, 2014 IEEE 6th international symposium on cyberspace safety and security, 2014 IEEE 11th international conference on embedded software and system (HPCC, CSS, ICESS). pp 42–45.

[4]

Jog A et al (2015) Anatomy of GPU memory system for multi-application execution. In: Proceedings of the 2015 international symposium on memory systems. pp 223–234. https://www.cs.utexas.edu/~skeckler/pubs/MEMSYS_2015_Anatomy.pdf. Accessed 31 Jul 2019

[5]

Papadimitriou M, Markou E, Fumero J, Stratikopoulos A, Blanaru F, Kotselidis C (2021) Multiple-tasks on multiple-devices (MTMD): exploiting concurrency in heterogeneous managed runtimes. In: Proceedings of the 17th ACM SIGPLAN/SIGOPS international conference on virtual execution environments. pp 125–138.

[6]

Khalid YN, Aleem M, Prodan R, Iqbal MA, and Islam MA E-OSched: a load balancing scheduler for heterogeneous multicores J Supercomput 2018

[7]

OpenCL overview—The Khronos Group Inc (2021). https://www.khronos.org/opencl/ Accessed 02 May 2021

[8]

Ahmed U, Aleem M, Noman Khalid Y, Arshad Islam M, and Azhar Iqbal M RALB-HC: a resource-aware load balancer for heterogeneous cluster Concurr Comput Pract Exp 2019

[9]

Munshi A (2009) The OpenCL specification. In: Hot chips 21 symposium (HCS). IEEE, pp 1–314. https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf. Accessed 30 Oct 2017

[10]

Wen Y Multi-tasking scheduling for heterogeneous systems 2017 Edinburgh University of Edinburgh

[11]

AMD E-Series E2–7110 Notebook Processor—NotebookCheck.net Tech. https://www.notebookcheck.net/AMD-E-Series-E2-7110-Notebook-Processor.144996.0.html. Accessed 03 May 2021

[12]

Lee VW et al. Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU Isca 2010 38 3 451-460

[13]

Thompson NC and Spanuth S The decline of computers as a general purpose technology Commun ACM 2021 64 3 64-72

[14]

Hechtman BA and Sorin DJ exploring memory consistency for massively-threaded throughput-oriented processors ACM SIGARCH Comput Archit News 2013 41 3 201-212

[15]

Kiran U, Gautam SS, and Sharma D GPU-based matrix-free finite element solver exploiting symmetry of elemental matrices Computing 2020 102 9 1941-1965

[16]

Lee S-Y, Wu C-J (2018) Performance characterization, prediction, and optimization for heterogeneous systems with multi-level memory interference. In: 2017 IEEE international symposium on workload characterization (IISWC), pp 43–53. [Online]. https://pdfs.semanticscholar.org/bfed/ce6668172edbec76fc67c29f7a320979c110.pdf. Accessed 07 Feb 2018

[17]

Baruah T et al (2020) Valkyrie: leveraging inter-TLB locality to enhance GPU performance. In: Parallel architectures and compilation techniques—conference proceedings, PACT, pp 456–466.

[18]

Kang H, Kwon HC, and Kim D HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs Computing 2020 102 12 2607-2631

[19]

Chilingaryan S, Kopmann A, Ametova E, Mirone A (2018) ESRF: balancing load of GPU subsystems to accelerate image reconstruction in parallel beam tomography. In: 30th international symposium on computer architecture and high performance computing (SBAC-PAD). pp 158–166.

[20]

Shen M, Luo G (2017) Corolla: GPU-accelerated FPGA routing based on subgraph dynamic expansion. In: Proceedings of the 2017 ACM/SIGDA international symposium on field-programmable gate arrays. pp 105–114.

[21]

Zhao Z, Song L, Xie R, Yang X (2016) GPU accelerated high-quality video/image super-resolution. In: 2016 IEEE international symposium on broadband multimedia systems and broadcasting (BMSB). pp 1–4. [Online]. http://medialab.sjtu.edu.cn/publications/2016/BMSB2016_ZhaoSongYangXie.pdf. Accessed 28 Jun 2019

[22]

Sun Y et al (2019) MGPUSim: enabling multi-GPU performance modeling and optimization. In: Proceedings—international symposium on computer architecture. pp 197–209.

[23]

Ausavarungnirun R et al. MASK: redesigning the GPU memory hierarchy to support multi-application concurrency ACM SIGPLAN Not 2018 53 2 503-518

[24]

Grauer-Gray S, Xu L, Searles R, Ayalasomayajula S, Cavazos J (2012) Auto-tuning a high-level language targeted to GPU codes. In: Innovative parallel computing (InPar). pp 1–10. [Online]. https://www.eecis.udel.edu/~searles/resources/autotune-HMPP.pdf. Accessed 31 Jul 2017

[25]

Wen Y, O’Boyle MF (2017) Merge or separate? Multi-job scheduling for OpenCL Kernels on CPU/GPU platforms. In: Proceedings of the general purpose GPUs. pp 22–31.

[26]

Choi HJ, Son DO, Kang SG, Kim JM, Lee H-H, and Kim CH An efficient scheduling scheme using estimated execution time for heterogeneous computing systems J Supercomput 2013 65 2 886-902

[27]

Wen Y, O’Boyle MFP, Fensch C (2018) MaxPair: enhance OpenCL concurrent kernel execution by weighted maximum matching. In: Proceedings of the 11th workshop on general purpose GPUs. pp 40–49.

[28]

Pai S, Thazhuthaveetil MJ, and Govindarajan R Improving GPGPU concurrency with elastic kernels ACM SIGPLAN Not 2013 48 4 407-418

[29]

Zhong J and He B Kernelet: high-throughput GPU kernel executions with dynamic slicing and scheduling IEEE Trans Parallel Distrib Syst 2014 25 6 1522-1532

[30]

Wen Y, Wang Z, O’boyle MFP (2014) Smart multi-task scheduling for OpenCL programs on CPU/GPU heterogeneous platforms. In: 2014 21st international conference on high performance computing (HiPC). pp 1–10

[31]

Margiolas C, O’Boyle MFP (2016) Portable and transparent software managed scheduling on accelerators for fair resource sharing. In: Proceedings of the 2016 international symposium on code generation and optimization. pp 82–93.

[32]

Jiao Q, Lu M, Huynh HP, Mitra T (2015) Improving GPGPU energy-efficiency through concurrent kernel execution and DVFS. In: Proceedings of the 2015 IEEE/ACM international symposium on code generation and optimization, CGO 2015. pp 1–11.

[33]

Belviranli ME, Khorasani F, Bhuyan LN, Gupta R (2016) CuMAS: data transfer aware multi-application scheduling for shared GPUs. In: Proceedings of the 2016 international conference on supercomputing, {ICS} 2016, Istanbul, Turkey, June 1–3, 2016. pp 31:1–31:12.

[34]

Pérez B, Bosque JL, Beivide R (2016) Simplifying programming and load balancing of data parallel applications on heterogeneous systems. In: Proceedings of the 9th annual workshop on general purpose processing using graphics processing unit—GPGPU ’16. pp 42–51.

[35]

Boyer M, Skadron K, Che S, Jayasena N (2013) Load balancing in a changing world: dealing with heterogeneity and performance variability. In: Proceedings of the ACM international conference on computing frontiers. p 21

[36]

Kaleem R, Barik R, Shpeisman T, Hu C, Lewis BT, Pingali K (2017) Adaptive heterogeneous scheduling for integrated GPUs. In: Proceedings of the 23rd international conference on parallel architectures and compilation. pp 151–162. [Online]. http://ai2-s2-pdfs.s3.amazonaws.com/8db3/c11cd85195f459b8ba82fe3326e8f86f1d52.pdf. Accessed 07 Jul 2017

[37]

Gregg C, Boyer M, Hazelwood K, Skadron K (2011) Dynamic heterogeneous scheduling decisions using historical runtime data. In: Proceedings of the 2nd workshop on applications for multi-and many-core processors. San Jose, CA. pp 1–12

[38]

Grewe MF, Dominik, O’Boyle (2011) A static task partitioning approach for heterogeneous systems using OpenCL. In: International conference on compiler construction. pp 286–305

[39]

Kofler K, Grasso I, Cosenza B, Fahringer T (2013) An automatic input-sensitive approach for heterogeneous task partitioning categories and subject descriptors. In: Proceedings of the 27th international ACM conference on international conference on supercomputing—ICS ’13. pp 149–160.

[40]

Insieme Compiler Project. http://www.insieme-compiler.org/. Accessed 02 May 2021

[41]

The LLVM Compiler Infrastructure Project. https://llvm.org/. Accessed 02, May 2021

[42]

Ravi VT, Becchi M, Jiang W, Agrawal G, and Chakradhar S Scheduling concurrent applications on a cluster of CPU–GPU nodes Futur Gener Comput Syst 2013 29 8 2262-2271

[43]

Olson RS, Bartley N, Urbanowicz RJ, Moore JH (2016) Evaluation of a tree-based pipeline optimization tool for automating data science. In: Proceedings of the genetic and evolutionary computation conference 2016. pp 485–492.

[44]

Laadan D, Vainshtein R, Curiel Y, Katz G, Rokach L (2020) MetaTPOT: enhancing a tree-based pipeline optimization tool using meta-learning. In: International conference on information and knowledge management, proceedings. pp 2097–2100.

[45]

Friedman JH Stochastic gradient boosting Comput Stat Data Anal 2002 38 4 367-378

[46]

Biau G, Cadre B, and Rouvière L Accelerated gradient boosting Mach Learn 2019 108 6 971-992

Cited By

Zhang ZChen YHe BZhang Z(2023)NIOT: A Novel Inference Optimization of Transformers on Modern CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.326953034:6(1982-1995)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3269530
Ahmed ULin JSrivastava G(2022)Heterogeneous Energy-aware Load Balancing for Industry 4.0 and IoT EnvironmentsACM Transactions on Management Information Systems10.1145/354385913:4(1-23)Online publication date: 11-Jun-2022
https://dl.acm.org/doi/10.1145/3543859
Ahmed ULin JSrivastava GMekala MJung H(2022)Fuzzy Active Learning to Detect OpenCL Kernel Heterogeneous Machines in Cyber Physical SystemsIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2022.316715830:11(4618-4629)Online publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1109/TFUZZ.2022.3167158

Index Terms

FusionCL: a machine-learning based approach for OpenCL kernel fusion to increase system performance

Index terms have been assigned to the content through auto-classification.

Recommendations

Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU
GREENCOM-CPSCOM '10: Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing

As one of the most popular accelerators, Graphics Processing Unit (GPU) has demonstrated high computing power in several application fields. On the other hand, GPU also produces high power consumption and has been one of the most largest power consumers ...
Optimizing CUDA code by kernel fusion: application on BLAS

Contemporary GPUs have significantly higher arithmetic throughput than a memory throughput. Hence, many GPU kernels are memory bound and cannot exploit arithmetic power of the GPU. Examples of memory-bound kernels are BLAS-1 (vector---vector) and BLAS-2 ...
Kernel Fusion in OpenCL
Euro-Par 2021: Parallel Processing Workshops
Abstract
Kernel Fusion is a widely applicable optimization for numerical libraries on heterogeneous systems. However, most automated systems capable of performing the optimization require changes to software development practices, through language ...

Comments

Information & Contributors

Information

Published In

cover image Computing

Computing Volume 103, Issue 10

Oct 2021

288 pages

ISSN:0010-485X

Issue’s Table of Contents

© The Author(s), under exclusive licence to Springer-Verlag GmbH Austria, part of Springer Nature 2021.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 October 2021

Accepted: 08 May 2021

Received: 12 August 2020

Author Tags

Author Tag

68M20

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 27 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang ZChen YHe BZhang Z(2023)NIOT: A Novel Inference Optimization of Transformers on Modern CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.326953034:6(1982-1995)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3269530
Ahmed ULin JSrivastava G(2022)Heterogeneous Energy-aware Load Balancing for Industry 4.0 and IoT EnvironmentsACM Transactions on Management Information Systems10.1145/354385913:4(1-23)Online publication date: 11-Jun-2022
https://dl.acm.org/doi/10.1145/3543859
Ahmed ULin JSrivastava GMekala MJung H(2022)Fuzzy Active Learning to Detect OpenCL Kernel Heterogeneous Machines in Cyber Physical SystemsIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2022.316715830:11(4618-4629)Online publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1109/TFUZZ.2022.3167158

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents