research-article

On optimizing machine learning workloads via kernel fusion

Authors:

Shirish Tatikonda,

Matthias Boehm,

Berthold Reinwald,

Keith Campbell,

John Keenleyside,

P. SadayappanAuthors Info & Claims

ACM SIGPLAN Notices, Volume 50, Issue 8

Pages 173 - 182

https://doi.org/10.1145/2858788.2688521

Published: 24 January 2015 Publication History

Abstract

Exploitation of parallel architectures has become critical to scalable machine learning (ML). Since a wide range of ML algorithms employ linear algebraic operators, GPUs with BLAS libraries are a natural choice for such an exploitation. Two approaches are commonly pursued: (i) developing specific GPU accelerated implementations of complete ML algorithms; and (ii) developing GPU kernels for primitive linear algebraic operators like matrix-vector multiplication, which are then used in developing ML algorithms. This paper extends the latter approach by developing fused kernels for a combination of primitive operators that are commonly found in popular ML algorithms. We identify the generic pattern of computation (alpha * X^T (v * (X * y)) + beta * z) and its various instantiations. We develop a fused kernel to optimize this computation on GPUs -- with specialized techniques to handle both sparse and dense matrices. This approach not only reduces the cost of data loads due to improved temporal locality but also enables other optimizations like coarsening and hierarchical aggregation of partial results. We also present an analytical model that considers input data characteristics and available GPU resources to estimate near-optimal settings for kernel launch parameters. The proposed approach provides speedups ranging from 2 to 67 for different instances of the generic pattern compared to launching multiple operator-level kernels using GPU accelerated libraries. We conclude by demonstrating the effectiveness of the approach in improving end-to-end performance on an entire ML algorithm.

References

[1]

E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov. Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects. Journal of Physics: Conference Series, 180(1):012037, 2009.

[2]

P. Baldi, P. Sadowski, and D. Whiteson. Searching for Exotic Particles in High-Energy Physics with Deep Learning. Nature communications, 5, 2014.

[3]

N. Bell and M. Garland. Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, page 18. ACM, 2009.

Digital Library

[4]

J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio. Theano: a CPU and GPU Math Expression Compiler. In Proceedings of the Python for scientific computing conference (SciPy), volume 4, page 3, 2010.

[5]

M. Boehm, S. Tatikonda, B. Reinwald, P. Sen, Y. Tian, D. Burdick, and S. Vaithyanathan. Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML. Proceedings of the VLDB Endowment, 7(7):553–564, 2014.

Digital Library

[6]

J. Canny and H. Zhao. Big Data Analytics with Small Footprint: Squaring the Cloud. In Proceedings of the 19th international conference on Knowledge discovery and data mining, pages 95–103, 2013.

Digital Library

[7]

J. Canny and H. Zhao. BIDMach: Large-Scale Learning with Zero Memory Allocation. In BigLearning, NIPS Workshop, 2013.

[8]

B. Catanzaro, N. Sundaram, and K. Keutzer. Fast Support Vector Machine Training and Classification on Graphics Processors. In Proceedings of the 25th international conference on Machine learning, pages 104–111. ACM, 2008.

Digital Library

[9]

O. Chapelle. Training a Support Vector Machine in the Primal. Neural Computation, 19(5):1155–1178, 2007.

Digital Library

[10]

A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and N. Andrew. Deep Learning with COTS HPC Systems. In Proceedings of the 30th International Conference on Machine Learning, pages 1337–1345, 2013.

[11]

R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A Matlab-like Environment for Machine Learning. In BigLearning, NIPS Workshop, 2011.

[12]

cuBLAS. The NVIDIA CUDA Basic Linear Algebra Subroutines Library. URL https://developer.nvidia.com/cublas.

[13]

CUDA. A Parallel Computing Platform and Programming Model Invented by NVIDIA. URL http://www.nvidia.com/object/ cuda_home_new.html.

[14]

cuDNN. The NVIDIA CUDA Library of Primitives for Deep Neural Networks. URL https://developer.nvidia.com/cuDNN.

[15]

cuSPARSE. The NVIDIA CUDA Sparse Matrix Library. URL https://developer.nvidia.com/cusparse.

[16]

R. Farivar, D. Rebolledo, E. Chan, and R. H. Campbell. A Parallel Implementation of K-Means Clustering on GPUs. In PDPTA, pages 340–345, 2008.

[17]

A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. SystemML: Declarative Machine Learning on MapReduce. In IEEE 27th International Conference on Data Engineering, pages 231–242. IEEE, 2011.

Digital Library

[18]

HiPLAR. High Performance Linear Algebra in R. URL http: //hiplar.org.

[19]

C.-H. Ho and C.-J. Lin. Large-Scale Linear Support Vector Regression. The Journal of Machine Learning Research, 13(1):3323–3348, 2012.

Digital Library

[20]

Intel. Math Kernel Library. URL https://software.intel.com/ en-us/intel-mkl.

[21]

Khronos OpenCL Working Group. The OpenCL Specification, version 1.0.29, December 2008.

[22]

D. B. Kirk and W. H. Wen-mei. Programming Massively Parallel Processors: a Hands-on Approach. Newnes, 2012.

Digital Library

[23]

J. M. Kleinberg. Authoritative Sources in a Hyperlinked Environment. Journal of the ACM (JACM), 46(5):604–632, 1999.

Digital Library

[24]

C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust Region Newton Method for Logistic Regression. Journal of Machine Learning Research, 9: 627–650, 2008.

Digital Library

[25]

N. Lopes and B. Ribeiro. GPUMLib: An Efficient Open-Source GPU Machine Learning Library. International Journal of Computer Information Systems and Industrial Management Applications, 3:355– 362, 2011.

[26]

N. Lopes, B. Ribeiro, and R. Quintas. GPUMLib: a New Library to Combine Machine Learning Algorithms with Graphics Processing Units. In Hybrid Intelligent Systems (HIS), 2010 10th International Conference on, pages 229–232. IEEE, 2010.

[27]

MAGMA. Matrix Algebra on GPU and Multicore Architectures. URL http://icl.cs.utk.edu/magma.

[28]

P. McCullagh. Generalized Linear Models. European Journal of Operational Research, 16(3):285–292, 1984.

[29]

J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable Parallel Programming with CUDA. Queue, 6(2):40–53, 2008.

Digital Library

[30]

NVIDIA. CUDA GPU Occupancy Calculator. URL http://developer.download.nvidia.com/compute/ cuda/CUDA_Occupancy_calculator.xls.

[31]

NVVP. NVIDIA Visual Profiler. URL https://developer. nvidia.com/nvidia-visual-profiler.

[32]

R. Raina, A. Madhavan, and A. Y. Ng. Large-Scale Deep Unsupervised Learning Using Graphics Processors. In International Conference on Machine Learning, volume 9, pages 873–880, 2009.

Digital Library

[33]

T. Sharp. Implementing Decision Trees and Forests on a GPU. In Computer Vision–ECCV 2008, pages 595–608. Springer, 2008.

[34]

J. Stamper, A. Niculescu-Mizil, S. Ritter, G. Gordon, and K. Koedinger. Algebra I 2008-2009. Challenge Data Set from KDD Cup 2010 Educational Data Mining Challenge, 2013. URL http: //pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp.

[35]

C. Zhang, A. Kumar, and C. Ré. Materialization Optimizations for Feature Selection Workloads. In SIGMOD, 2014.

Digital Library

Cited By

Xia CZhao JSun QWang ZWen YYu TFeng XCui HTsafrir DMUSUVATHI MGupta RAbu-Ghazaleh N(2024)Optimizing Deep Learning Inference via Global Analysis and Tensor ExpressionsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624858(286-301)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3617232.3624858
Zhang ZChen YHe BZhang Z(2023)NIOT: A Novel Inference Optimization of Transformers on Modern CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.326953034:6(1982-1995)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3269530
Jeon BPark SLiao PXu SChen TJia ZKloeckner AMoreira J(2022)CollageProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569651(517-529)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569651
Show More Cited By

Index Terms

On optimizing machine learning workloads via kernel fusion
1. Computing methodologies
  1. Artificial intelligence
    1. Philosophical/theoretical foundations of artificial intelligence
  2. Concurrent computing methodologies
    1. Concurrent programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Concurrent programming languages

Recommendations

On optimizing machine learning workloads via kernel fusion
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Exploitation of parallel architectures has become critical to scalable machine learning (ML). Since a wide range of ML algorithms employ linear algebraic operators, GPUs with BLAS libraries are a natural choice for such an exploitation. Two approaches ...
FusionCL: a machine-learning based approach for OpenCL kernel fusion to increase system performance
Abstract
Employing general-purpose graphics processing units (GPGPU) with the help of OpenCL has resulted in greatly reducing the execution time of data-parallel applications by taking advantage of the massive available parallelism. However, when a small ...
Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission
IPDPSW '12: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum

Data warehousing applications represent an emergent application arena that requires the processing of relational queries and computations over massive amounts of data. Modern general purpose GPUs are high core count architectures that potentially offer ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 50, Issue 8

PPoPP '15

August 2015

290 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/2858788

Editor:
Andy Gill
University of Kansas, Lawrence, KS

Issue’s Table of Contents

PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
January 2015
290 pages
ISBN:9781450332057
DOI:10.1145/2688500
General Chair:
Albert Cohen
INRIA, France
,
Program Chair:
David Grove
IBM Research, USA

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 January 2015

Published in SIGPLAN Volume 50, Issue 8

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

34
Total Citations
View Citations
1,111
Total Downloads

Downloads (Last 12 months)64
Downloads (Last 6 weeks)6

Reflects downloads up to 27 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xia CZhao JSun QWang ZWen YYu TFeng XCui HTsafrir DMUSUVATHI MGupta RAbu-Ghazaleh N(2024)Optimizing Deep Learning Inference via Global Analysis and Tensor ExpressionsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624858(286-301)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3617232.3624858
Zhang ZChen YHe BZhang Z(2023)NIOT: A Novel Inference Optimization of Transformers on Modern CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.326953034:6(1982-1995)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3269530
Jeon BPark SLiao PXu SChen TJia ZKloeckner AMoreira J(2022)CollageProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569651(517-529)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569651
Yuan GDong PSun MNiu WLi ZCai YLi YLiu JJiang WLin XRen BTang XWang Y(2022)Mobile or FPGA? A Comprehensive Evaluation on Energy Efficiency and a Unified Optimization FrameworkACM Transactions on Embedded Computing Systems10.1145/352857821:5(1-22)Online publication date: 20-Apr-2022
https://dl.acm.org/doi/10.1145/3528578
Han DLee JKim MIves ZBonifati AEl Abbadi A(2022)FuseME: Distributed Matrix Computation Engine based on Cuboid-based Fused Operator and Plan GenerationProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517895(1891-1904)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517895
Zheng ZYang XZhao PLong GZhu KZhu FZhao WLiu XYang JZhai JSong SLin WFalsafi BFerdman MLu SWenisch T(2022)AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architecturesProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507723(359-373)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3503222.3507723
Gong YYuan GZhan ZNiu WLi ZZhao PCai YLiu SRen BLin XTang XWang Y(2022)Automatic Mapping of the Best-Suited DNN Pruning Schemes for Real-Time Mobile AccelerationACM Transactions on Design Automation of Electronic Systems10.1145/349553227:5(1-26)Online publication date: 24-Feb-2022
https://dl.acm.org/doi/10.1145/3495532
Şakar ÖSafari MHuisman MWijs A(2022)Alpinist: An Annotation-Aware GPU Program OptimizerTools and Algorithms for the Construction and Analysis of Systems10.1007/978-3-030-99527-0_18(332-352)Online publication date: 30-Mar-2022
https://doi.org/10.1007/978-3-030-99527-0_18
Bai YLi CZhou QYi JGong PYan FChen RXu Y(2021)Gradient Compression Supercharged High-Performance Data Parallel DNN TrainingProceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles10.1145/3477132.3483553(359-375)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3477132.3483553
Zhan ZGong YZhao PYuan GNiu WWu YZhang TJayaweera MKaeli DRen BLin XWang Y(2021)Achieving on-Mobile Real-Time Super-Resolution with Neural Architecture and Pruning Search2021 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV48922.2021.00478(4801-4811)Online publication date: Oct-2021
https://doi.org/10.1109/ICCV48922.2021.00478
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents