Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2688500.2688521acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

On optimizing machine learning workloads via kernel fusion

Published: 24 January 2015 Publication History
  • Get Citation Alerts
  • Abstract

    Exploitation of parallel architectures has become critical to scalable machine learning (ML). Since a wide range of ML algorithms employ linear algebraic operators, GPUs with BLAS libraries are a natural choice for such an exploitation. Two approaches are commonly pursued: (i) developing specific GPU accelerated implementations of complete ML algorithms; and (ii) developing GPU kernels for primitive linear algebraic operators like matrix-vector multiplication, which are then used in developing ML algorithms. This paper extends the latter approach by developing fused kernels for a combination of primitive operators that are commonly found in popular ML algorithms. We identify the generic pattern of computation (alpha * X^T (v * (X * y)) + beta * z) and its various instantiations. We develop a fused kernel to optimize this computation on GPUs -- with specialized techniques to handle both sparse and dense matrices. This approach not only reduces the cost of data loads due to improved temporal locality but also enables other optimizations like coarsening and hierarchical aggregation of partial results. We also present an analytical model that considers input data characteristics and available GPU resources to estimate near-optimal settings for kernel launch parameters. The proposed approach provides speedups ranging from 2 to 67 for different instances of the generic pattern compared to launching multiple operator-level kernels using GPU accelerated libraries. We conclude by demonstrating the effectiveness of the approach in improving end-to-end performance on an entire ML algorithm.

    References

    [1]
    E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov. Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects. Journal of Physics: Conference Series, 180(1):012037, 2009.
    [2]
    P. Baldi, P. Sadowski, and D. Whiteson. Searching for Exotic Particles in High-Energy Physics with Deep Learning. Nature communications, 5, 2014.
    [3]
    N. Bell and M. Garland. Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, page 18. ACM, 2009.
    [4]
    J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio. Theano: a CPU and GPU Math Expression Compiler. In Proceedings of the Python for scientific computing conference (SciPy), volume 4, page 3, 2010.
    [5]
    M. Boehm, S. Tatikonda, B. Reinwald, P. Sen, Y. Tian, D. Burdick, and S. Vaithyanathan. Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML. Proceedings of the VLDB Endowment, 7(7):553–564, 2014.
    [6]
    J. Canny and H. Zhao. Big Data Analytics with Small Footprint: Squaring the Cloud. In Proceedings of the 19th international conference on Knowledge discovery and data mining, pages 95–103, 2013.
    [7]
    J. Canny and H. Zhao. BIDMach: Large-Scale Learning with Zero Memory Allocation. In BigLearning, NIPS Workshop, 2013.
    [8]
    B. Catanzaro, N. Sundaram, and K. Keutzer. Fast Support Vector Machine Training and Classification on Graphics Processors. In Proceedings of the 25th international conference on Machine learning, pages 104–111. ACM, 2008.
    [9]
    O. Chapelle. Training a Support Vector Machine in the Primal. Neural Computation, 19(5):1155–1178, 2007.
    [10]
    A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and N. Andrew. Deep Learning with COTS HPC Systems. In Proceedings of the 30th International Conference on Machine Learning, pages 1337–1345, 2013.
    [11]
    R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A Matlab-like Environment for Machine Learning. In BigLearning, NIPS Workshop, 2011.
    [12]
    cuBLAS. The NVIDIA CUDA Basic Linear Algebra Subroutines Library. URL https://developer.nvidia.com/cublas.
    [13]
    CUDA. A Parallel Computing Platform and Programming Model Invented by NVIDIA. URL http://www.nvidia.com/object/ cuda_home_new.html.
    [14]
    cuDNN. The NVIDIA CUDA Library of Primitives for Deep Neural Networks. URL https://developer.nvidia.com/cuDNN.
    [15]
    cuSPARSE. The NVIDIA CUDA Sparse Matrix Library. URL https://developer.nvidia.com/cusparse.
    [16]
    R. Farivar, D. Rebolledo, E. Chan, and R. H. Campbell. A Parallel Implementation of K-Means Clustering on GPUs. In PDPTA, pages 340–345, 2008.
    [17]
    A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. SystemML: Declarative Machine Learning on MapReduce. In IEEE 27th International Conference on Data Engineering, pages 231–242. IEEE, 2011.
    [18]
    HiPLAR. High Performance Linear Algebra in R. URL http: //hiplar.org.
    [19]
    C.-H. Ho and C.-J. Lin. Large-Scale Linear Support Vector Regression. The Journal of Machine Learning Research, 13(1):3323–3348, 2012.
    [20]
    Intel. Math Kernel Library. URL https://software.intel.com/ en-us/intel-mkl.
    [21]
    Khronos OpenCL Working Group. The OpenCL Specification, version 1.0.29, December 2008.
    [22]
    D. B. Kirk and W. H. Wen-mei. Programming Massively Parallel Processors: a Hands-on Approach. Newnes, 2012.
    [23]
    J. M. Kleinberg. Authoritative Sources in a Hyperlinked Environment. Journal of the ACM (JACM), 46(5):604–632, 1999.
    [24]
    C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust Region Newton Method for Logistic Regression. Journal of Machine Learning Research, 9: 627–650, 2008.
    [25]
    N. Lopes and B. Ribeiro. GPUMLib: An Efficient Open-Source GPU Machine Learning Library. International Journal of Computer Information Systems and Industrial Management Applications, 3:355– 362, 2011.
    [26]
    N. Lopes, B. Ribeiro, and R. Quintas. GPUMLib: a New Library to Combine Machine Learning Algorithms with Graphics Processing Units. In Hybrid Intelligent Systems (HIS), 2010 10th International Conference on, pages 229–232. IEEE, 2010.
    [27]
    MAGMA. Matrix Algebra on GPU and Multicore Architectures. URL http://icl.cs.utk.edu/magma.
    [28]
    P. McCullagh. Generalized Linear Models. European Journal of Operational Research, 16(3):285–292, 1984.
    [29]
    J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable Parallel Programming with CUDA. Queue, 6(2):40–53, 2008.
    [30]
    NVIDIA. CUDA GPU Occupancy Calculator. URL http://developer.download.nvidia.com/compute/ cuda/CUDA_Occupancy_calculator.xls.
    [31]
    NVVP. NVIDIA Visual Profiler. URL https://developer. nvidia.com/nvidia-visual-profiler.
    [32]
    R. Raina, A. Madhavan, and A. Y. Ng. Large-Scale Deep Unsupervised Learning Using Graphics Processors. In International Conference on Machine Learning, volume 9, pages 873–880, 2009.
    [33]
    T. Sharp. Implementing Decision Trees and Forests on a GPU. In Computer Vision–ECCV 2008, pages 595–608. Springer, 2008.
    [34]
    J. Stamper, A. Niculescu-Mizil, S. Ritter, G. Gordon, and K. Koedinger. Algebra I 2008-2009. Challenge Data Set from KDD Cup 2010 Educational Data Mining Challenge, 2013. URL http: //pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp.
    [35]
    C. Zhang, A. Kumar, and C. Ré. Materialization Optimizations for Feature Selection Workloads. In SIGMOD, 2014.

    Cited By

    View all
    • (2023)Chimera: An Analytical Optimizing Framework for Effective Compute-intensive Operators Fusion2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071018(1113-1126)Online publication date: Feb-2023
    • (2022)Spark-Based Deep Learning for Recognition of Epileptic Seizures With Multimodal Brain WavesSmart Healthcare for Sustainable Urban Development10.4018/978-1-6684-2508-4.ch012(158-182)Online publication date: 24-Jun-2022
    • (2022)Triton Join: Efficiently Scaling to a Large Join State on GPUs with Fast InterconnectsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517911(1017-1032)Online publication date: 10-Jun-2022
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
    January 2015
    290 pages
    ISBN:9781450332057
    DOI:10.1145/2688500
    • cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 50, Issue 8
      PPoPP '15
      August 2015
      290 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2858788
      • Editor:
      • Andy Gill
      Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 January 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Dense
    2. Fused Kernel
    3. GPU
    4. Machine Learning
    5. Sparse

    Qualifiers

    • Research-article

    Conference

    PPoPP '15
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 230 of 1,014 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)50
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 10 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Chimera: An Analytical Optimizing Framework for Effective Compute-intensive Operators Fusion2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071018(1113-1126)Online publication date: Feb-2023
    • (2022)Spark-Based Deep Learning for Recognition of Epileptic Seizures With Multimodal Brain WavesSmart Healthcare for Sustainable Urban Development10.4018/978-1-6684-2508-4.ch012(158-182)Online publication date: 24-Jun-2022
    • (2022)Triton Join: Efficiently Scaling to a Large Join State on GPUs with Fast InterconnectsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517911(1017-1032)Online publication date: 10-Jun-2022
    • (2020)DaydreamProceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference10.5555/3489146.3489169(337-352)Online publication date: 15-Jul-2020
    • (2020)GEVO-MLProceedings of the 2020 Genetic and Evolutionary Computation Conference Companion10.1145/3377929.3398139(1849-1856)Online publication date: 8-Jul-2020
    • (2019)From loop fusion to kernel fusion: a domain-specific approach to locality optimizationProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314901(242-253)Online publication date: 16-Feb-2019
    • (2019)Data Management in Machine Learning SystemsSynthesis Lectures on Data Management10.2200/S00895ED1V01Y201901DTM05714:1(1-173)Online publication date: 25-Feb-2019
    • (2019)Tuple-oriented Compression for Large-scale Mini-batch Stochastic Gradient DescentProceedings of the 2019 International Conference on Management of Data10.1145/3299869.3300070(1517-1534)Online publication date: 25-Jun-2019
    • (2018)On optimizing operator fusion plans for large-scale machine learning in systemMLProceedings of the VLDB Endowment10.14778/3229863.322986511:12(1755-1768)Online publication date: 1-Aug-2018
    • (2018)In-register parameter caching for dynamic neural nets with virtual persistent processor specializationProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00038(377-389)Online publication date: 20-Oct-2018
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media