research-article

Adaptive Optimization Modeling of Preconditioned Conjugate Gradient on Multi-GPUs

Authors:

Ronghua LiangAuthors Info & Claims

ACM Transactions on Parallel Computing (TOPC), Volume 3, Issue 3

Article No.: 16, Pages 1 - 33

https://doi.org/10.1145/2990849

Published: 25 October 2016 Publication History

Abstract

The preconditioned conjugate gradient (PCG) algorithm is a well-known iterative method for solving sparse linear systems in scientific computations. GPU-accelerated PCG algorithms for large-sized problems have attracted considerable attention recently. However, on a specific multi-GPU platform, producing a highly parallel PCG implementation for any large-sized problem requires significant time because several manual steps are involved in adjusting the related parameters and selecting an appropriate storage format for the matrix block that is assigned to each GPU. This motivates us to propose adaptive optimization modeling of PCG on multi-GPUs, which mainly involves the following parts: (1) an optimization multi-GPU parallel framework of PCG and (2) the profile-based optimization modeling for each one of the main components of the PCG algorithm, including vector operation, inner product, and sparse matrix-vector multiplication (SpMV). Our model does not construct a new storage format or kernel but automatically and rapidly generates an optimal parallel PCG algorithm for any problem on a specific multi-GPU platform by integrating existing storage formats and kernels. We take a vector operation kernel, an inner-product kernel, and five popular SpMV kernels for an example to present the idea of constructing the model. Given that our model is general, independent of the problems, and dependent on the resources of devices, this model is constructed only once for each type of GPU. The experiments validate the high efficiency of our proposed model.

References

[1]

M. Alexander, L. Anton, and A. Arutyun. 2010. Automatically tuning sparse matrix-vector multiplication for GPU architectures. In Proc. 5th International Conference on High Performance Embedded Architectures and Compilers (HiPEAC’10) (Lecture Notes in Computer Science), Vol. 5952. Springer, Berlin, 111--125.

Digital Library

[2]

S. S. Baghsorkhi, M. Delahaye, and S. J. Patel. 2010. An adaptive performance modeling tool for GPU architectures. In Proc. 15th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP’10). ACM, New York, NY, 105--114.

Digital Library

[3]

N. Bell and M. Garland. 2008. Efficient Sparse Matrix-vector Multiplication on CUDA. Technique report. NVIDIA.

[4]

N. Bell and M. Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proc. Conf. High Performance Computing Networking, Storage and Analysis (SC’09). ACM, New York, NY, 14--19.

Digital Library

[5]

N. Bell and M. Garland. 2015. Cusp: Generic parallel algorithms for sparse matrix and graph computations, version 0.5.1. Retrieved from http://cusp-library.googlecode.com.

[6]

M. Benzi, J. K. Cullum, and M. Tuma. 2000. Robust approximate inverse preconditioning for the conjugate gradient method. SIAM J. Sci. Comput. 22, 4 (February 2000), 1318--1332.

Digital Library

[7]

L. Buatois, G. Caumon, and B. Lévy. 2009. Concurrent number cruncher: A GPU implementation of a general sparse linear solver. J. Parallel Emergent Distrib. Syst. 24, 3 (June 2009), 205--223.

Digital Library

[8]

J. W. Choi, A. Singh, and R. W. Vuduc. 2010. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In Proc. 15th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP’10). ACM, New York, NY, 9--14.

Digital Library

[9]

E. Chow and Y. Saad. 1998. Approximate inverse preconditioners via sparse-sparse iterations. SIAM J. Sci. Comput. 19, 3 (July 1998), 995--1023.

Digital Library

[10]

A. T. Chronopoulos and C. W. Gear. 1989. S-step iterative methods for symmetric linear systems. J. Comput. Appl. Math. 25, 2 (February 1989), 153--156.

Digital Library

[11]

H. V. Dang and B. Schmidt. 2013. CUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operations. Parallel Comput. 39, 11 (November 2013), 737--750.

Digital Library

[12]

T. A. Davis and Y. Hu. 2011. The university of Florida sparse matrix collection. ACM Trans. Math. Software 38, 1 (November 2011), 1--25.

Digital Library

[13]

M. M. Dehnavi, D. M. Fernández, and D. Giannacopoulos. 2011. Enhancing the performance of conjugate gradient solvers on graphic processing units. IEEE Trans. Magn. 47, 5 (May 2011), 1162--1165.

[14]

J. Gao, R. Liang, and J. Wang. 2014. Research on the conjugate gradient algorithm with a modified incomplete Cholesky preconditioner on GPU. J. Parallel Distr. Com. 74, 2 (February 2014), 2088--2098.

Digital Library

[15]

P. Guo and L. Wang. 2010. Auto-tuning CUDA parameters for sparse matrix-vector multiplication on GPUs. In Proc. Int’l Conf. Computational and Information Sciences (ICCIS’10). IEEE Computer Society, Washington, DC, 1154--1157.

Digital Library

[16]

P. Guo, L. Wang, and P. Chen. 2014. A performance modeling and optimization analysis tool for sparse matrix-vector multiplication on GPUs. IEEE T. Parall. Distr. 25, 5 (May 2014), 1112--1123.

Digital Library

[17]

R. Helfenstein and J. Koko. 2012. Parallel preconditioned conjugate gradient algorithm on GPU. J. Comput. Appl. Math. 236, 15 (September 2012), 3584--3590.

Digital Library

[18]

S. Hong and H. Kim. 2009. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proc. 36th ACM Ann. Int’l Symp. Computer Architecture (ISCA’09). ACM, New York, NY, 152--163.

Digital Library

[19]

V. Karakasis, T. Gkountouvas, and K. Kourtis. 2013. An extended compression format for the optimization of sparse matrix-vector multiplication. IEEE Trans. Parall. Distr. 24, 10 (October 2013), 1930--1940.

Digital Library

[20]

R. Li and Y. Saad. 2013. GPU-accelerated preconditioned iterative linear solvers. J. Supercomput. 63, 2 (February 2013), 443--466.

Digital Library

[21]

J. Nickolls, I. Buck, and M. Garland. 2008. Scalable parallel programming with CUDA. ACM Queue 6, 2 (March 2008), 40--53.

Digital Library

[22]

NVIDIA. 2014. CUBLAS Library 6.5. Retrieved from docs.nvidia.com/cuda/cuda-c-programming-guide.

[23]

NVIDIA. 2014a. CUDA C Best Practices Guide 6.5. Retrieved from http://docs.nvidia.com/cuda/cuda-c-best-practices-guide.

[24]

NVIDIA. 2014b. CUDA C Programming Guide 6.5. Retrieved from http://docs.nvidia.com/cuda/cuda-c-programming-guide.

[25]

NVIDIA. 2014c. CUSPARSE Library 6.5. Retrieved from https://developer.nvidia.com/cusparse.

[26]

G. Oyarzun, R. Borrell, A. Gorobets, and A. Oliva. 2014. MPI-CUDA sparse matrix-vector multiplication for the conjugate gradient method with an approximate inverse preconditioner. Comput. Fluids 92 (March 2014), 244--252.

[27]

Y. Saad. 2003. Iterative Methods for Sparse Linear Systems, Second Version. SIAM, Philadelphia, PA.

Digital Library

[28]

Y. Saad and H. A. van der Vorst. 2000. Iterative solution of linear systems in the 20th century. J. Comput. Appl. Math. 123, 1--2 (November 2000), 1--33.

Digital Library

[29]

W. T. Tang, W. J. Tan, and R. Ray. 2013. Accelerating sparse matrix-vector multiplication on GPUs using bit-representation-optimized schemes. In Proc. Int’l Conf. High Performance Computing, Networking, Storage and Analysis (SC’13). IEEE, 1--12.

Digital Library

[30]

F. Vázquez, J. J. Fernández, and E. M. Garzón. 2011. A new approach for sparse matrix vector product on NVIDIA GPUs. Concurr. Comp-Pract. E. 23, 8 (September 2011), 815--826.

Digital Library

[31]

M. Verschoor and A. C. Jalba. 2012. Analysis and performance estimation of the conjugate gradient method on multiple GPUs. Parallel Comput. 38, 10/11 (October 2012), 552--575.

Digital Library

Cited By

Gao JJi WWang Y(2024)Optimization of Large-Scale Sparse Matrix-Vector Multiplication on Multi-GPU SystemsACM Transactions on Architecture and Code Optimization10.1145/3676847Online publication date: 8-Jul-2024
https://doi.org/10.1145/3676847
Chu XGao JSheng B(2021)Efficient Concurrent L1-Minimization Solvers on GPUsComputer Systems Science and Engineering10.32604/csse.2021.01714438:3(305-320)Online publication date: 2021
https://doi.org/10.32604/csse.2021.017144
Chen YXiao GXiao ZYang W(2019)hpSpMV: A Heterogeneous Parallel Computing Scheme for SpMV on the Sunway TaihuLight Supercomputer2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC/SmartCity/DSS.2019.00142(989-995)Online publication date: Aug-2019
https://doi.org/10.1109/HPCC/SmartCity/DSS.2019.00142
Show More Cited By

Index Terms

Adaptive Optimization Modeling of Preconditioned Conjugate Gradient on Multi-GPUs
1. Computing methodologies
  1. Modeling and simulation
    1. Simulation types and techniques
      1. Massively parallel and high-performance simulations
  2. Parallel computing methodologies
    1. Parallel algorithms
      1. Massively parallel algorithms

Recommendations

CUDA-enabled Sparse Matrix-Vector Multiplication on GPUs using atomic operations

We propose the Sliced Coordinate Format (SCOO) for Sparse Matrix-Vector Multiplication on GPUs.An associated CUDA implementation which takes advantage of atomic operations is presented.We propose partitioning methods to transform a given sparse matrix ...
Parallel preconditioned conjugate gradient algorithm on GPU

We propose a parallel implementation of the Preconditioned Conjugate Gradient algorithm on a GPU platform. The preconditioning matrix is an approximate inverse derived from the SSOR preconditioner. Used through sparse matrix-vector multiplication, the ...
Optimizing preconditioned conjugate gradient on TaihuLight for OpenFOAM
CCGrid '18: Proceedings of the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

Porting the domain-specific software OpenFOAM onto the TaihuLight supercomputer is a challenging task, due to the highly memory-bound nature of both the supercomputer's processor (SW26010) and the software's liner solvers. Our study tackles this ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing

ACM Transactions on Parallel Computing Volume 3, Issue 3

December 2016

145 pages

ISSN:2329-4949

EISSN:2329-4957

DOI:10.1145/3012407

Editor:
Phillip B. Gibbons
Carnegie Mellon University, Pittsburgh, USA

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 October 2016

Accepted: 01 August 2016

Revised: 01 August 2016

Received: 01 December 2015

Published in TOPC Volume 3, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
226
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gao JJi WWang Y(2024)Optimization of Large-Scale Sparse Matrix-Vector Multiplication on Multi-GPU SystemsACM Transactions on Architecture and Code Optimization10.1145/3676847Online publication date: 8-Jul-2024
https://doi.org/10.1145/3676847
Chu XGao JSheng B(2021)Efficient Concurrent L1-Minimization Solvers on GPUsComputer Systems Science and Engineering10.32604/csse.2021.01714438:3(305-320)Online publication date: 2021
https://doi.org/10.32604/csse.2021.017144
Chen YXiao GXiao ZYang W(2019)hpSpMV: A Heterogeneous Parallel Computing Scheme for SpMV on the Sunway TaihuLight Supercomputer2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC/SmartCity/DSS.2019.00142(989-995)Online publication date: Aug-2019
https://doi.org/10.1109/HPCC/SmartCity/DSS.2019.00142
Xia YGao JHe G(2019)A Parallel Solving Algorithm on GPU for the Time-Domain Linear System with Diagonal Sparse MatricesBig Scientific Data Benchmarks, Architecture, and Systems10.1007/978-981-13-5910-1_7(73-84)Online publication date: 12-Jan-2019
https://doi.org/10.1007/978-981-13-5910-1_7
Gao JWu KWang YQi PHe G(2018)GPU-accelerated preconditioned GMRES method for two-dimensional Maxwell's equationsInternational Journal of Computer Mathematics10.1080/00207160.2017.128015694:10(2122-2144)Online publication date: 27-Dec-2018
https://dl.acm.org/doi/10.1080/00207160.2017.1280156

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents