Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Adaptive Optimization Modeling of Preconditioned Conjugate Gradient on Multi-GPUs

Published: 25 October 2016 Publication History
  • Get Citation Alerts
  • Abstract

    The preconditioned conjugate gradient (PCG) algorithm is a well-known iterative method for solving sparse linear systems in scientific computations. GPU-accelerated PCG algorithms for large-sized problems have attracted considerable attention recently. However, on a specific multi-GPU platform, producing a highly parallel PCG implementation for any large-sized problem requires significant time because several manual steps are involved in adjusting the related parameters and selecting an appropriate storage format for the matrix block that is assigned to each GPU. This motivates us to propose adaptive optimization modeling of PCG on multi-GPUs, which mainly involves the following parts: (1) an optimization multi-GPU parallel framework of PCG and (2) the profile-based optimization modeling for each one of the main components of the PCG algorithm, including vector operation, inner product, and sparse matrix-vector multiplication (SpMV). Our model does not construct a new storage format or kernel but automatically and rapidly generates an optimal parallel PCG algorithm for any problem on a specific multi-GPU platform by integrating existing storage formats and kernels. We take a vector operation kernel, an inner-product kernel, and five popular SpMV kernels for an example to present the idea of constructing the model. Given that our model is general, independent of the problems, and dependent on the resources of devices, this model is constructed only once for each type of GPU. The experiments validate the high efficiency of our proposed model.

    References

    [1]
    M. Alexander, L. Anton, and A. Arutyun. 2010. Automatically tuning sparse matrix-vector multiplication for GPU architectures. In Proc. 5th International Conference on High Performance Embedded Architectures and Compilers (HiPEAC’10) (Lecture Notes in Computer Science), Vol. 5952. Springer, Berlin, 111--125.
    [2]
    S. S. Baghsorkhi, M. Delahaye, and S. J. Patel. 2010. An adaptive performance modeling tool for GPU architectures. In Proc. 15th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP’10). ACM, New York, NY, 105--114.
    [3]
    N. Bell and M. Garland. 2008. Efficient Sparse Matrix-vector Multiplication on CUDA. Technique report. NVIDIA.
    [4]
    N. Bell and M. Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proc. Conf. High Performance Computing Networking, Storage and Analysis (SC’09). ACM, New York, NY, 14--19.
    [5]
    N. Bell and M. Garland. 2015. Cusp: Generic parallel algorithms for sparse matrix and graph computations, version 0.5.1. Retrieved from http://cusp-library.googlecode.com.
    [6]
    M. Benzi, J. K. Cullum, and M. Tuma. 2000. Robust approximate inverse preconditioning for the conjugate gradient method. SIAM J. Sci. Comput. 22, 4 (February 2000), 1318--1332.
    [7]
    L. Buatois, G. Caumon, and B. Lévy. 2009. Concurrent number cruncher: A GPU implementation of a general sparse linear solver. J. Parallel Emergent Distrib. Syst. 24, 3 (June 2009), 205--223.
    [8]
    J. W. Choi, A. Singh, and R. W. Vuduc. 2010. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In Proc. 15th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP’10). ACM, New York, NY, 9--14.
    [9]
    E. Chow and Y. Saad. 1998. Approximate inverse preconditioners via sparse-sparse iterations. SIAM J. Sci. Comput. 19, 3 (July 1998), 995--1023.
    [10]
    A. T. Chronopoulos and C. W. Gear. 1989. S-step iterative methods for symmetric linear systems. J. Comput. Appl. Math. 25, 2 (February 1989), 153--156.
    [11]
    H. V. Dang and B. Schmidt. 2013. CUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operations. Parallel Comput. 39, 11 (November 2013), 737--750.
    [12]
    T. A. Davis and Y. Hu. 2011. The university of Florida sparse matrix collection. ACM Trans. Math. Software 38, 1 (November 2011), 1--25.
    [13]
    M. M. Dehnavi, D. M. Fernández, and D. Giannacopoulos. 2011. Enhancing the performance of conjugate gradient solvers on graphic processing units. IEEE Trans. Magn. 47, 5 (May 2011), 1162--1165.
    [14]
    J. Gao, R. Liang, and J. Wang. 2014. Research on the conjugate gradient algorithm with a modified incomplete Cholesky preconditioner on GPU. J. Parallel Distr. Com. 74, 2 (February 2014), 2088--2098.
    [15]
    P. Guo and L. Wang. 2010. Auto-tuning CUDA parameters for sparse matrix-vector multiplication on GPUs. In Proc. Int’l Conf. Computational and Information Sciences (ICCIS’10). IEEE Computer Society, Washington, DC, 1154--1157.
    [16]
    P. Guo, L. Wang, and P. Chen. 2014. A performance modeling and optimization analysis tool for sparse matrix-vector multiplication on GPUs. IEEE T. Parall. Distr. 25, 5 (May 2014), 1112--1123.
    [17]
    R. Helfenstein and J. Koko. 2012. Parallel preconditioned conjugate gradient algorithm on GPU. J. Comput. Appl. Math. 236, 15 (September 2012), 3584--3590.
    [18]
    S. Hong and H. Kim. 2009. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proc. 36th ACM Ann. Int’l Symp. Computer Architecture (ISCA’09). ACM, New York, NY, 152--163.
    [19]
    V. Karakasis, T. Gkountouvas, and K. Kourtis. 2013. An extended compression format for the optimization of sparse matrix-vector multiplication. IEEE Trans. Parall. Distr. 24, 10 (October 2013), 1930--1940.
    [20]
    R. Li and Y. Saad. 2013. GPU-accelerated preconditioned iterative linear solvers. J. Supercomput. 63, 2 (February 2013), 443--466.
    [21]
    J. Nickolls, I. Buck, and M. Garland. 2008. Scalable parallel programming with CUDA. ACM Queue 6, 2 (March 2008), 40--53.
    [22]
    NVIDIA. 2014. CUBLAS Library 6.5. Retrieved from docs.nvidia.com/cuda/cuda-c-programming-guide.
    [23]
    NVIDIA. 2014a. CUDA C Best Practices Guide 6.5. Retrieved from http://docs.nvidia.com/cuda/cuda-c-best-practices-guide.
    [24]
    NVIDIA. 2014b. CUDA C Programming Guide 6.5. Retrieved from http://docs.nvidia.com/cuda/cuda-c-programming-guide.
    [25]
    NVIDIA. 2014c. CUSPARSE Library 6.5. Retrieved from https://developer.nvidia.com/cusparse.
    [26]
    G. Oyarzun, R. Borrell, A. Gorobets, and A. Oliva. 2014. MPI-CUDA sparse matrix-vector multiplication for the conjugate gradient method with an approximate inverse preconditioner. Comput. Fluids 92 (March 2014), 244--252.
    [27]
    Y. Saad. 2003. Iterative Methods for Sparse Linear Systems, Second Version. SIAM, Philadelphia, PA.
    [28]
    Y. Saad and H. A. van der Vorst. 2000. Iterative solution of linear systems in the 20th century. J. Comput. Appl. Math. 123, 1--2 (November 2000), 1--33.
    [29]
    W. T. Tang, W. J. Tan, and R. Ray. 2013. Accelerating sparse matrix-vector multiplication on GPUs using bit-representation-optimized schemes. In Proc. Int’l Conf. High Performance Computing, Networking, Storage and Analysis (SC’13). IEEE, 1--12.
    [30]
    F. Vázquez, J. J. Fernández, and E. M. Garzón. 2011. A new approach for sparse matrix vector product on NVIDIA GPUs. Concurr. Comp-Pract. E. 23, 8 (September 2011), 815--826.
    [31]
    M. Verschoor and A. C. Jalba. 2012. Analysis and performance estimation of the conjugate gradient method on multiple GPUs. Parallel Comput. 38, 10/11 (October 2012), 552--575.

    Cited By

    View all
    • (2024)Optimization of Large-Scale Sparse Matrix-Vector Multiplication on Multi-GPU SystemsACM Transactions on Architecture and Code Optimization10.1145/3676847Online publication date: 8-Jul-2024
    • (2021)Efficient Concurrent L1-Minimization Solvers on GPUsComputer Systems Science and Engineering10.32604/csse.2021.01714438:3(305-320)Online publication date: 2021
    • (2019)hpSpMV: A Heterogeneous Parallel Computing Scheme for SpMV on the Sunway TaihuLight Supercomputer2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC/SmartCity/DSS.2019.00142(989-995)Online publication date: Aug-2019
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Parallel Computing
    ACM Transactions on Parallel Computing  Volume 3, Issue 3
    December 2016
    145 pages
    ISSN:2329-4949
    EISSN:2329-4957
    DOI:10.1145/3012407
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 October 2016
    Accepted: 01 August 2016
    Revised: 01 August 2016
    Received: 01 December 2015
    Published in TOPC Volume 3, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. CUDA
    2. Optimization model
    3. multiple GPUs
    4. preconditioned conjugate gradient
    5. sparse matrix-vector multiplication

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • National Science Foundation of China

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)7
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 09 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Optimization of Large-Scale Sparse Matrix-Vector Multiplication on Multi-GPU SystemsACM Transactions on Architecture and Code Optimization10.1145/3676847Online publication date: 8-Jul-2024
    • (2021)Efficient Concurrent L1-Minimization Solvers on GPUsComputer Systems Science and Engineering10.32604/csse.2021.01714438:3(305-320)Online publication date: 2021
    • (2019)hpSpMV: A Heterogeneous Parallel Computing Scheme for SpMV on the Sunway TaihuLight Supercomputer2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC/SmartCity/DSS.2019.00142(989-995)Online publication date: Aug-2019
    • (2019)A Parallel Solving Algorithm on GPU for the Time-Domain Linear System with Diagonal Sparse MatricesBig Scientific Data Benchmarks, Architecture, and Systems10.1007/978-981-13-5910-1_7(73-84)Online publication date: 12-Jan-2019
    • (2018)GPU-accelerated preconditioned GMRES method for two-dimensional Maxwell's equationsInternational Journal of Computer Mathematics10.1080/00207160.2017.128015694:10(2122-2144)Online publication date: 27-Dec-2018

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media