Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content

On optimizing machine learning workloads via kernel fusion

Published: 24 January 2015 Publication History


Exploitation of parallel architectures has become critical to scalable machine learning (ML). Since a wide range of ML algorithms employ linear algebraic operators, GPUs with BLAS libraries are a natural choice for such an exploitation. Two approaches are commonly pursued: (i) developing specific GPU accelerated implementations of complete ML algorithms; and (ii) developing GPU kernels for primitive linear algebraic operators like matrix-vector multiplication, which are then used in developing ML algorithms. This paper extends the latter approach by developing fused kernels for a combination of primitive operators that are commonly found in popular ML algorithms. We identify the generic pattern of computation (alpha * X^T (v * (X * y)) + beta * z) and its various instantiations. We develop a fused kernel to optimize this computation on GPUs -- with specialized techniques to handle both sparse and dense matrices. This approach not only reduces the cost of data loads due to improved temporal locality but also enables other optimizations like coarsening and hierarchical aggregation of partial results. We also present an analytical model that considers input data characteristics and available GPU resources to estimate near-optimal settings for kernel launch parameters. The proposed approach provides speedups ranging from 2 to 67 for different instances of the generic pattern compared to launching multiple operator-level kernels using GPU accelerated libraries. We conclude by demonstrating the effectiveness of the approach in improving end-to-end performance on an entire ML algorithm.


E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov. Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects. Journal of Physics: Conference Series, 180(1):012037, 2009.
P. Baldi, P. Sadowski, and D. Whiteson. Searching for Exotic Particles in High-Energy Physics with Deep Learning. Nature communications, 5, 2014.
N. Bell and M. Garland. Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, page 18. ACM, 2009.
J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio. Theano: a CPU and GPU Math Expression Compiler. In Proceedings of the Python for scientific computing conference (SciPy), volume 4, page 3, 2010.
M. Boehm, S. Tatikonda, B. Reinwald, P. Sen, Y. Tian, D. Burdick, and S. Vaithyanathan. Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML. Proceedings of the VLDB Endowment, 7(7):553–564, 2014.
J. Canny and H. Zhao. Big Data Analytics with Small Footprint: Squaring the Cloud. In Proceedings of the 19th international conference on Knowledge discovery and data mining, pages 95–103, 2013.
J. Canny and H. Zhao. BIDMach: Large-Scale Learning with Zero Memory Allocation. In BigLearning, NIPS Workshop, 2013.
B. Catanzaro, N. Sundaram, and K. Keutzer. Fast Support Vector Machine Training and Classification on Graphics Processors. In Proceedings of the 25th international conference on Machine learning, pages 104–111. ACM, 2008.
O. Chapelle. Training a Support Vector Machine in the Primal. Neural Computation, 19(5):1155–1178, 2007.
A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and N. Andrew. Deep Learning with COTS HPC Systems. In Proceedings of the 30th International Conference on Machine Learning, pages 1337–1345, 2013.
R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A Matlab-like Environment for Machine Learning. In BigLearning, NIPS Workshop, 2011.
cuBLAS. The NVIDIA CUDA Basic Linear Algebra Subroutines Library. URL https://developer.nvidia.com/cublas.
CUDA. A Parallel Computing Platform and Programming Model Invented by NVIDIA. URL http://www.nvidia.com/object/ cuda_home_new.html.
cuDNN. The NVIDIA CUDA Library of Primitives for Deep Neural Networks. URL https://developer.nvidia.com/cuDNN.
cuSPARSE. The NVIDIA CUDA Sparse Matrix Library. URL https://developer.nvidia.com/cusparse.
R. Farivar, D. Rebolledo, E. Chan, and R. H. Campbell. A Parallel Implementation of K-Means Clustering on GPUs. In PDPTA, pages 340–345, 2008.
A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. SystemML: Declarative Machine Learning on MapReduce. In IEEE 27th International Conference on Data Engineering, pages 231–242. IEEE, 2011.
HiPLAR. High Performance Linear Algebra in R. URL http: //hiplar.org.
C.-H. Ho and C.-J. Lin. Large-Scale Linear Support Vector Regression. The Journal of Machine Learning Research, 13(1):3323–3348, 2012.
Intel. Math Kernel Library. URL https://software.intel.com/ en-us/intel-mkl.
Khronos OpenCL Working Group. The OpenCL Specification, version 1.0.29, December 2008.
D. B. Kirk and W. H. Wen-mei. Programming Massively Parallel Processors: a Hands-on Approach. Newnes, 2012.
J. M. Kleinberg. Authoritative Sources in a Hyperlinked Environment. Journal of the ACM (JACM), 46(5):604–632, 1999.
C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust Region Newton Method for Logistic Regression. Journal of Machine Learning Research, 9: 627–650, 2008.
N. Lopes and B. Ribeiro. GPUMLib: An Efficient Open-Source GPU Machine Learning Library. International Journal of Computer Information Systems and Industrial Management Applications, 3:355– 362, 2011.
N. Lopes, B. Ribeiro, and R. Quintas. GPUMLib: a New Library to Combine Machine Learning Algorithms with Graphics Processing Units. In Hybrid Intelligent Systems (HIS), 2010 10th International Conference on, pages 229–232. IEEE, 2010.
MAGMA. Matrix Algebra on GPU and Multicore Architectures. URL http://icl.cs.utk.edu/magma.
P. McCullagh. Generalized Linear Models. European Journal of Operational Research, 16(3):285–292, 1984.
J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable Parallel Programming with CUDA. Queue, 6(2):40–53, 2008.
NVIDIA. CUDA GPU Occupancy Calculator. URL http://developer.download.nvidia.com/compute/ cuda/CUDA_Occupancy_calculator.xls.
NVVP. NVIDIA Visual Profiler. URL https://developer. nvidia.com/nvidia-visual-profiler.
R. Raina, A. Madhavan, and A. Y. Ng. Large-Scale Deep Unsupervised Learning Using Graphics Processors. In International Conference on Machine Learning, volume 9, pages 873–880, 2009.
T. Sharp. Implementing Decision Trees and Forests on a GPU. In Computer Vision–ECCV 2008, pages 595–608. Springer, 2008.
J. Stamper, A. Niculescu-Mizil, S. Ritter, G. Gordon, and K. Koedinger. Algebra I 2008-2009. Challenge Data Set from KDD Cup 2010 Educational Data Mining Challenge, 2013. URL http: //pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp.
C. Zhang, A. Kumar, and C. Ré. Materialization Optimizations for Feature Selection Workloads. In SIGMOD, 2014.

Cited By

View all
  • (2024)Optimizing Deep Learning Inference via Global Analysis and Tensor ExpressionsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624858(286-301)Online publication date: 27-Apr-2024
  • (2023)NIOT: A Novel Inference Optimization of Transformers on Modern CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.326953034:6(1982-1995)Online publication date: 1-Jun-2023
  • (2022)CollageProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569651(517-529)Online publication date: 8-Oct-2022
  • Show More Cited By



Information & Contributors


Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 50, Issue 8
PPoPP '15
August 2015
290 pages
  • Editor:
  • Andy Gill
Issue’s Table of Contents
  • cover image ACM Conferences
    PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
    January 2015
    290 pages
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]


Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 January 2015
Published in SIGPLAN Volume 50, Issue 8

Check for updates

Author Tags

  1. Dense
  2. Fused Kernel
  3. GPU
  4. Machine Learning
  5. Sparse


  • Research-article


Other Metrics

Bibliometrics & Citations


Article Metrics

  • Downloads (Last 12 months)64
  • Downloads (Last 6 weeks)6
Reflects downloads up to 27 Jan 2025

Other Metrics


Cited By

View all
  • (2024)Optimizing Deep Learning Inference via Global Analysis and Tensor ExpressionsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624858(286-301)Online publication date: 27-Apr-2024
  • (2023)NIOT: A Novel Inference Optimization of Transformers on Modern CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.326953034:6(1982-1995)Online publication date: 1-Jun-2023
  • (2022)CollageProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569651(517-529)Online publication date: 8-Oct-2022
  • (2022)Mobile or FPGA? A Comprehensive Evaluation on Energy Efficiency and a Unified Optimization FrameworkACM Transactions on Embedded Computing Systems10.1145/352857821:5(1-22)Online publication date: 20-Apr-2022
  • (2022)FuseME: Distributed Matrix Computation Engine based on Cuboid-based Fused Operator and Plan GenerationProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517895(1891-1904)Online publication date: 10-Jun-2022
  • (2022)AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architecturesProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507723(359-373)Online publication date: 28-Feb-2022
  • (2022)Automatic Mapping of the Best-Suited DNN Pruning Schemes for Real-Time Mobile AccelerationACM Transactions on Design Automation of Electronic Systems10.1145/349553227:5(1-26)Online publication date: 24-Feb-2022
  • (2022)Alpinist: An Annotation-Aware GPU Program OptimizerTools and Algorithms for the Construction and Analysis of Systems10.1007/978-3-030-99527-0_18(332-352)Online publication date: 30-Mar-2022
  • (2021)Gradient Compression Supercharged High-Performance Data Parallel DNN TrainingProceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles10.1145/3477132.3483553(359-375)Online publication date: 26-Oct-2021
  • (2021)Achieving on-Mobile Real-Time Super-Resolution with Neural Architecture and Pruning Search2021 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV48922.2021.00478(4801-4811)Online publication date: Oct-2021
  • Show More Cited By

View Options

Login options

View options


View or Download as a PDF file.



View online with eReader.







Share this Publication link

Share on social media