research-article

On optimizing machine learning workloads via kernel fusion

Authors:

Shirish Tatikonda,

Matthias Boehm,

Berthold Reinwald,

Keith Campbell,

John Keenleyside,

P. SadayappanAuthors Info & Claims

PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Pages 173 - 182

https://doi.org/10.1145/2688500.2688521

Published: 24 January 2015 Publication History

Abstract

Exploitation of parallel architectures has become critical to scalable machine learning (ML). Since a wide range of ML algorithms employ linear algebraic operators, GPUs with BLAS libraries are a natural choice for such an exploitation. Two approaches are commonly pursued: (i) developing specific GPU accelerated implementations of complete ML algorithms; and (ii) developing GPU kernels for primitive linear algebraic operators like matrix-vector multiplication, which are then used in developing ML algorithms. This paper extends the latter approach by developing fused kernels for a combination of primitive operators that are commonly found in popular ML algorithms. We identify the generic pattern of computation (alpha * X^T (v * (X * y)) + beta * z) and its various instantiations. We develop a fused kernel to optimize this computation on GPUs -- with specialized techniques to handle both sparse and dense matrices. This approach not only reduces the cost of data loads due to improved temporal locality but also enables other optimizations like coarsening and hierarchical aggregation of partial results. We also present an analytical model that considers input data characteristics and available GPU resources to estimate near-optimal settings for kernel launch parameters. The proposed approach provides speedups ranging from 2 to 67 for different instances of the generic pattern compared to launching multiple operator-level kernels using GPU accelerated libraries. We conclude by demonstrating the effectiveness of the approach in improving end-to-end performance on an entire ML algorithm.

References

[1]

E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov. Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects. Journal of Physics: Conference Series, 180(1):012037, 2009.

[2]

P. Baldi, P. Sadowski, and D. Whiteson. Searching for Exotic Particles in High-Energy Physics with Deep Learning. Nature communications, 5, 2014.

[3]

N. Bell and M. Garland. Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, page 18. ACM, 2009.

Digital Library

[4]

J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio. Theano: a CPU and GPU Math Expression Compiler. In Proceedings of the Python for scientific computing conference (SciPy), volume 4, page 3, 2010.

[5]

M. Boehm, S. Tatikonda, B. Reinwald, P. Sen, Y. Tian, D. Burdick, and S. Vaithyanathan. Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML. Proceedings of the VLDB Endowment, 7(7):553–564, 2014.

Digital Library

[6]

J. Canny and H. Zhao. Big Data Analytics with Small Footprint: Squaring the Cloud. In Proceedings of the 19th international conference on Knowledge discovery and data mining, pages 95–103, 2013.

Digital Library

[7]

J. Canny and H. Zhao. BIDMach: Large-Scale Learning with Zero Memory Allocation. In BigLearning, NIPS Workshop, 2013.

[8]

B. Catanzaro, N. Sundaram, and K. Keutzer. Fast Support Vector Machine Training and Classification on Graphics Processors. In Proceedings of the 25th international conference on Machine learning, pages 104–111. ACM, 2008.

Digital Library

[9]

O. Chapelle. Training a Support Vector Machine in the Primal. Neural Computation, 19(5):1155–1178, 2007.

Digital Library

[10]

A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and N. Andrew. Deep Learning with COTS HPC Systems. In Proceedings of the 30th International Conference on Machine Learning, pages 1337–1345, 2013.

[11]

R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A Matlab-like Environment for Machine Learning. In BigLearning, NIPS Workshop, 2011.

[12]

cuBLAS. The NVIDIA CUDA Basic Linear Algebra Subroutines Library. URL https://developer.nvidia.com/cublas.

[13]

CUDA. A Parallel Computing Platform and Programming Model Invented by NVIDIA. URL http://www.nvidia.com/object/ cuda_home_new.html.

[14]

cuDNN. The NVIDIA CUDA Library of Primitives for Deep Neural Networks. URL https://developer.nvidia.com/cuDNN.

[15]

cuSPARSE. The NVIDIA CUDA Sparse Matrix Library. URL https://developer.nvidia.com/cusparse.

[16]

R. Farivar, D. Rebolledo, E. Chan, and R. H. Campbell. A Parallel Implementation of K-Means Clustering on GPUs. In PDPTA, pages 340–345, 2008.

[17]

A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. SystemML: Declarative Machine Learning on MapReduce. In IEEE 27th International Conference on Data Engineering, pages 231–242. IEEE, 2011.

Digital Library

[18]

HiPLAR. High Performance Linear Algebra in R. URL http: //hiplar.org.

[19]

C.-H. Ho and C.-J. Lin. Large-Scale Linear Support Vector Regression. The Journal of Machine Learning Research, 13(1):3323–3348, 2012.

Digital Library

[20]

Intel. Math Kernel Library. URL https://software.intel.com/ en-us/intel-mkl.

[21]

Khronos OpenCL Working Group. The OpenCL Specification, version 1.0.29, December 2008.

[22]

D. B. Kirk and W. H. Wen-mei. Programming Massively Parallel Processors: a Hands-on Approach. Newnes, 2012.

Digital Library

[23]

J. M. Kleinberg. Authoritative Sources in a Hyperlinked Environment. Journal of the ACM (JACM), 46(5):604–632, 1999.

Digital Library

[24]

C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust Region Newton Method for Logistic Regression. Journal of Machine Learning Research, 9: 627–650, 2008.

Digital Library

[25]

N. Lopes and B. Ribeiro. GPUMLib: An Efficient Open-Source GPU Machine Learning Library. International Journal of Computer Information Systems and Industrial Management Applications, 3:355– 362, 2011.

[26]

N. Lopes, B. Ribeiro, and R. Quintas. GPUMLib: a New Library to Combine Machine Learning Algorithms with Graphics Processing Units. In Hybrid Intelligent Systems (HIS), 2010 10th International Conference on, pages 229–232. IEEE, 2010.

[27]

MAGMA. Matrix Algebra on GPU and Multicore Architectures. URL http://icl.cs.utk.edu/magma.

[28]

P. McCullagh. Generalized Linear Models. European Journal of Operational Research, 16(3):285–292, 1984.

[29]

J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable Parallel Programming with CUDA. Queue, 6(2):40–53, 2008.

Digital Library

[30]

NVIDIA. CUDA GPU Occupancy Calculator. URL http://developer.download.nvidia.com/compute/ cuda/CUDA_Occupancy_calculator.xls.

[31]

NVVP. NVIDIA Visual Profiler. URL https://developer. nvidia.com/nvidia-visual-profiler.

[32]

R. Raina, A. Madhavan, and A. Y. Ng. Large-Scale Deep Unsupervised Learning Using Graphics Processors. In International Conference on Machine Learning, volume 9, pages 873–880, 2009.

Digital Library

[33]

T. Sharp. Implementing Decision Trees and Forests on a GPU. In Computer Vision–ECCV 2008, pages 595–608. Springer, 2008.

[34]

J. Stamper, A. Niculescu-Mizil, S. Ritter, G. Gordon, and K. Koedinger. Algebra I 2008-2009. Challenge Data Set from KDD Cup 2010 Educational Data Mining Challenge, 2013. URL http: //pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp.

[35]

C. Zhang, A. Kumar, and C. Ré. Materialization Optimizations for Feature Selection Workloads. In SIGMOD, 2014.

Digital Library

Cited By

Zheng SChen SSong PChen RLi XYan SLin DLeng JLiang Y(2023)Chimera: An Analytical Optimizing Framework for Effective Compute-intensive Operators Fusion2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071018(1113-1126)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071018
Bhagchandani A(2022)Spark-Based Deep Learning for Recognition of Epileptic Seizures With Multimodal Brain WavesSmart Healthcare for Sustainable Urban Development10.4018/978-1-6684-2508-4.ch012(158-182)Online publication date: 24-Jun-2022
https://doi.org/10.4018/978-1-6684-2508-4.ch012
Lutz CBreß SZeuch SRabl TMarkl VIves ZBonifati AEl Abbadi A(2022)Triton Join: Efficiently Scaling to a Large Join State on GPUs with Fast InterconnectsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517911(1017-1032)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517911
Show More Cited By

Index Terms

On optimizing machine learning workloads via kernel fusion
1. Computing methodologies
  1. Artificial intelligence
    1. Philosophical/theoretical foundations of artificial intelligence
  2. Concurrent computing methodologies
    1. Concurrent programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Concurrent programming languages

Recommendations

On optimizing machine learning workloads via kernel fusion
PPoPP '15

Exploitation of parallel architectures has become critical to scalable machine learning (ML). Since a wide range of ML algorithms employ linear algebraic operators, GPUs with BLAS libraries are a natural choice for such an exploitation. Two approaches ...
High performance sparse multifrontal solvers on modern GPUs
Abstract
We have ported the numerical factorization and triangular solve phases of the sparse direct solver STRUMPACK to GPU. STRUMPACK implements sparse LU factorization using the multifrontal algorithm, which performs most of its operations ...
Highlights
- Ported the numerical factorization and triangular solve phases of the sparse direct multifrontal solver STRUMPACK to GPU.
Optimizing stencil application on multi-thread GPU architecture using stream programming model
ARCS'10: Proceedings of the 23rd international conference on Architecture of Computing Systems

With fast development of GPU hardware and software, using GPUs to accelerate non-graphics CPU applications is becoming inevitable trend. GPUs are good at performing ALU-intensive computation and feature high peak performance; however, how to harness ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

January 2015

290 pages

ISBN:9781450332057

DOI:10.1145/2688500

General Chair:
Albert Cohen
INRIA, France
,
Program Chair:
David Grove
IBM Research, USA

ACM SIGPLAN Notices Volume 50, Issue 8
PPoPP '15
August 2015
290 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2858788
Editor:
Andy Gill
University of Kansas, Lawrence, KS
Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 January 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PPoPP '15

Sponsor:

SIGPLAN

PPoPP '15: 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 7 - 11, 2015

CA, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

33
Total Citations
View Citations
1,080
Total Downloads

Downloads (Last 12 months)50
Downloads (Last 6 weeks)3

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zheng SChen SSong PChen RLi XYan SLin DLeng JLiang Y(2023)Chimera: An Analytical Optimizing Framework for Effective Compute-intensive Operators Fusion2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071018(1113-1126)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071018
Bhagchandani A(2022)Spark-Based Deep Learning for Recognition of Epileptic Seizures With Multimodal Brain WavesSmart Healthcare for Sustainable Urban Development10.4018/978-1-6684-2508-4.ch012(158-182)Online publication date: 24-Jun-2022
https://doi.org/10.4018/978-1-6684-2508-4.ch012
Lutz CBreß SZeuch SRabl TMarkl VIves ZBonifati AEl Abbadi A(2022)Triton Join: Efficiently Scaling to a Large Join State on GPUs with Fast InterconnectsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517911(1017-1032)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517911
Zhu HPhanishayee APekhimenko GGavrilovska AZadok E(2020)DaydreamProceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference10.5555/3489146.3489169(337-352)Online publication date: 15-Jul-2020
https://dl.acm.org/doi/10.5555/3489146.3489169
Liou JWang XForrest SWu CCoello Coello C(2020)GEVO-MLProceedings of the 2020 Genetic and Evolutionary Computation Conference Companion10.1145/3377929.3398139(1849-1856)Online publication date: 8-Jul-2020
https://dl.acm.org/doi/10.1145/3377929.3398139
Qiao BReiche OHannig FTeich JKandemir MJimborean AMoseley T(2019)From loop fusion to kernel fusion: a domain-specific approach to locality optimizationProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314901(242-253)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.5555/3314872.3314901
Boehm MKumar AYang J(2019)Data Management in Machine Learning SystemsSynthesis Lectures on Data Management10.2200/S00895ED1V01Y201901DTM05714:1(1-173)Online publication date: 25-Feb-2019
https://doi.org/10.2200/S00895ED1V01Y201901DTM057
Li FChen LZeng YKumar AWu XNaughton JPatel JBoncz PManegold SAilamaki ADeshpande AKraska T(2019)Tuple-oriented Compression for Large-scale Mini-batch Stochastic Gradient DescentProceedings of the 2019 International Conference on Management of Data10.1145/3299869.3300070(1517-1534)Online publication date: 25-Jun-2019
https://dl.acm.org/doi/10.1145/3299869.3300070
Boehm MReinwald BHutchison DSen PEvfimievski APansare N(2018)On optimizing operator fusion plans for large-scale machine learning in systemMLProceedings of the VLDB Endowment10.14778/3229863.322986511:12(1755-1768)Online publication date: 1-Aug-2018
https://dl.acm.org/doi/10.14778/3229863.3229865
Khorasani FEsfeden HAbu-Ghazaleh NSarkar VOskin MInoue K(2018)In-register parameter caching for dynamic neural nets with virtual persistent processor specializationProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00038(377-389)Online publication date: 20-Oct-2018
https://dl.acm.org/doi/10.1109/MICRO.2018.00038
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents