article

GPU-accelerated preconditioned iterative linear solvers

Authors:

Yousef SaadAuthors Info & Claims

The Journal of Supercomputing, Volume 63, Issue 2

Pages 443 - 466

https://doi.org/10.1007/s11227-012-0825-3

Published: 01 February 2013 Publication History

Abstract

This work is an overview of our preliminary experience in developing a high-performance iterative linear solver accelerated by GPU coprocessors. Our goal is to illustrate the advantages and difficulties encountered when deploying GPU technology to perform sparse linear algebra computations. Techniques for speeding up sparse matrix-vector product (SpMV) kernels and finding suitable preconditioning methods are discussed. Our experiments with an NVIDIA TESLA M2070 show that for unstructured matrices SpMV kernels can be up to 8 times faster on the GPU than the Intel MKL on the host Intel Xeon X5675 Processor. Overall performance of the GPU-accelerated Incomplete Cholesky (IC) factorization preconditioned CG method can outperform its CPU counterpart by a smaller factor, up to 3, and GPU-accelerated The incomplete LU (ILU) factorization preconditioned GMRES method can achieve a speed-up nearing 4. However, with better suited preconditioning techniques for GPUs, this performance can be further improved.

References

[1]

Agarwal A, Levy M (2007) The kill rule for multicore. In: DAC'07: proceedings of the 44th annual design automation conference, New York, NY, USA. ACM, New York, pp 750-753.

[2]

Agullo E, Demmel J, Dongarra J, Hadri B, Kurzak J, Langou J, Ltaief H, Luszczek P, Tomov S (2009) Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects. J Phys Conf Ser 180(1):012037.

[3]

Ament M, Knittel G, Weiskopf D, Strasser W (2010) A parallel preconditioned conjugate gradient solver for the Poisson problem on a Multi-GPU platform. In: PDP'10: proceedings of the 2010 18th euromicro conference on parallel, distributed and network-based processing, Washington, DC, USA. IEEE Comput. Soc., Los Alamitos, pp 583-592.

Digital Library

[4]

Baskaran MM, Bordawekar R (2008) Optimizing sparse matrix-vector multiplication on GPUs. Tech report, IBM Research.

[5]

Bell N, Garland M (2009) Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: SC'09: proceedings of the conference on high performance computing networking, storage and analysis, New York, NY, USA. ACM, New York, pp 1-11.

[6]

Bell N, Garland M (2010) Cusp: generic parallel algorithms for sparse matrix and graph computations. Version 0.1.0.

[7]

Bolz J, Farmer I, Grinspun E, Schröoder P (2003) Sparse matrix solvers on the GPU: conjugate gradients and multigrid. ACM Trans Graph 22(3):917-924.

Digital Library

[8]

Choi JW, Singh A, Vuduc RW (2010) Model-driven autotuning of sparse matrix-vector multiply on GPUs. ACM SIGPLAN Not 45:115-126.

Digital Library

[9]

Davis PJ (1963) Interpolation and approximation. Blaisdell, Waltham.

[10]

Davis TA (1994) University of Florinda sparse matrix collection, na digest.

[11]

Erhel J, Guyomarc'H F, Saad Y (2001) Least-squares polynomial filters for ill-conditioned linear systems. Tech report umsi-2001-32, Minnesota Supercomputer Institute, University of Minnesota, Minneapolis, MN.

[12]

George A, Liu JWH (1989) The evolution of the minimum degree ordering algorithm. SIAM Rev 31(1):1-19.

Digital Library

[13]

Georgescu S, Okuda H (2007) Conjugate gradients on graphic hardware: performance & feasibility.

[14]

Gupta R (2009) A GPU implementation of a bubbly flow solver. Master's thesis, Delft Institute of Applied Mathematics, Delft University of Technology, 2628 BL, Delft, The Netherlands.

[15]

Karypis G, Kumar V (1998) Metis--a software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices, version 4.0. Tech report, University of Minnesota, Department of Computer Science/Army HPC Research Center.

[16]

Lanczos C (1950) An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. J Res Natl Bur Stand 45:255-282.

[17]

Monakov A, Avetisyan A (2009) Implementing blocked sparse matrix-vector multiplication on nvidia GPUs. In: Bertels K, Dimopoulos N, Silvano C, Wong S (eds) Embedded computer systems: architectures, modeling, and simulation. Lecture notes in computer science, vol 5657. Springer, Berlin, pp 289-297.

[18]

Monakov A, Lokhmotov A, Avetisyan A (2010) Automatically tuning sparse matrix-vector multiplication for GPU architectures. In: Patt Y, Foglia P, Duesterwald E, Faraboschi P, Martorell X (eds) High performance embedded architectures and compilers. Lecture notes in computer science, vol 5952. Springer, Berlin, pp 111-125.

[19]

NVIDIA (2012) CUBLAS library user guide 4.2.

[20]

NVIDIA (2012) CUDA CUSPARSE Library.

[21]

NVIDIA (2012) NVIDIA CUDA C programming guide 4.2.

[22]

Oberhuber T, Suzuki A, Vacata J (2010) New row-grouped csr format for storing the sparse matrices on GPU with implementation in CUDA. CoRR abs/1012.2270.

[23]

Robert Y (1982) Regular incomplete factorizations of real positive definite matrices. Linear Algebra Appl 48:105-117.

[24]

Saad Y (1990) SPARSKIT: A basic tool kit for sparse matrix computations. Tech report RIACS-90-20, Research Institute for Advanced Computer Science, NASA Ames Research Center, Moffett Field, CA.

[25]

Saad Y (1994) ILUT: a dual threshold incomplete ILU factorization. Numer Linear Algebra Appl 1:387-402.

[26]

Saad Y (2003) Iterative methods for sparse linear systems, 2nd edn. SIAM, Philadelphia.

[27]

Sengupta S, Harris M, Zhang Y, Owens JD (2007) Scan primitives for GPU computing. Graphics hardware 2007. ACM, New York, pp 97-106.

[28]

Sudan H, Klie H, Li R, Saad Y (2010) High performance manycore solvers for reservoir simulation. In: 12th European conference on the mathematics of oil recovery.

[29]

Vázquez F, Garzon EM, Martinez JA, Fernandez JJ (2009) The sparse matrix vector product on GPUs. Tech report, Department of Computer Architecture and Electronics, University of Almeria.

[30]

Volkov V, Demmel J (2008) LU, QR and Cholesky factorizations using vector capabilities of GPUs. Tech report, Computer Science Division University of California at Berkeley.

[31]

Wang M, Klie H, Parashar M, Sudan H (2009) Solving sparse linear systems on nvidia tesla GPUs. In: ICCS'09: proceedings of the 9th international conference on computational science. Springer, Berlin, pp 864-873.

[32]

Williams S, Bell N, Choi JW, Garland M, Oliker L, Vuduc R (2010) Scientific computing with multicore and accelerators. CRC Press, Boca Raton, pp 83-109. Chap 5.

[33]

Zhou Y, Saad Y, Tiago ML, Chelikowsky JR (2006) Parallel self-consistent-field calculations via Chebyshev-filtered subspace acceleration. Phys Rev E 74:066704.

Cited By

Li JLi LWang QXue WLiang JShi J(2024)Parallel optimization and application of unstructured sparse triangular solver on new generation of Sunway architectureParallel Computing10.1016/j.parco.2024.103080120:COnline publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1016/j.parco.2024.103080
Li ZCaviedes-Voullième DÖzgen-Xian IJiang SZheng N(2024)A comparison of numerical schemes for the GPU-accelerated simulation of variably-saturated groundwater flowEnvironmental Modelling & Software10.1016/j.envsoft.2023.105900171:COnline publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1016/j.envsoft.2023.105900
Li JLiang JXue WHu ZLi LShi J(2024)Toward efficient structured-grid triangular solver on sunway many-core processorsThe Journal of Supercomputing10.1007/s11227-023-05802-280:8(10610-10636)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1007/s11227-023-05802-2
Show More Cited By

Index Terms

GPU-accelerated preconditioned iterative linear solvers
1. Computing methodologies
  1. Symbolic and algebraic manipulation
    1. Symbolic and algebraic algorithms
      1. Linear algebra algorithms
2. General and reference
  1. Cross-computing tools and techniques
    1. Performance

Index terms have been assigned to the content through auto-classification.

Recommendations

Sparse matrix solvers on the GPU: conjugate gradients and multigrid
SIGGRAPH '03: ACM SIGGRAPH 2003 Papers

Many computer graphics applications require high-intensity numerical simulation. We show that such computations can be performed efficiently on the GPU, which we regard as a full function streaming processor with high floating-point performance. We ...
Accelerating iterative linear solvers using multiple graphical processing units

In this paper, we develop, study and implement iterative linear solvers and preconditioners using multiple graphical processing units GPUs. Techniques for accelerating sparse matrix–vector SpMV multiplication, linear solvers and preconditioners are ...
Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers
SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

Low-precision floating-point arithmetic is a powerful tool for accelerating scientific computing applications, especially those in artificial intelligence. Here, we present an investigation showing that other high-performance computing (HPC) ...

Comments

Information & Contributors

Information

Published In

cover image The Journal of Supercomputing

The Journal of Supercomputing Volume 63, Issue 2

February 2013

313 pages

ISSN:0920-8542

Issue’s Table of Contents

Copyright © Copyright © 2013 Springer Science+Business Media New York.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 February 2013

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

59
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li JLi LWang QXue WLiang JShi J(2024)Parallel optimization and application of unstructured sparse triangular solver on new generation of Sunway architectureParallel Computing10.1016/j.parco.2024.103080120:COnline publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1016/j.parco.2024.103080
Li ZCaviedes-Voullième DÖzgen-Xian IJiang SZheng N(2024)A comparison of numerical schemes for the GPU-accelerated simulation of variably-saturated groundwater flowEnvironmental Modelling & Software10.1016/j.envsoft.2023.105900171:COnline publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1016/j.envsoft.2023.105900
Li JLiang JXue WHu ZLi LShi J(2024)Toward efficient structured-grid triangular solver on sunway many-core processorsThe Journal of Supercomputing10.1007/s11227-023-05802-280:8(10610-10636)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1007/s11227-023-05802-2
Freire MFerrand JSeveso FDufrechou EEzzatti P(2023)A GPU method for the analysis stage of the SPTRSV kernelThe Journal of Supercomputing10.1007/s11227-023-05238-879:13(15051-15078)Online publication date: 13-Apr-2023
https://dl.acm.org/doi/10.1007/s11227-023-05238-8
Kiran USharma DGautam S(2023)A GPU-based framework for finite element analysis of elastoplastic problemsComputing10.1007/s00607-023-01169-7105:8(1673-1696)Online publication date: 5-Mar-2023
https://dl.acm.org/doi/10.1007/s00607-023-01169-7
Sato AMartins TTsuzuki M(2023)GPU implementation of an incomplete Cholesky conjugate gradient solver for a FEM-generated system using full kernel consolidationSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-023-08125-927:14(9307-9320)Online publication date: 7-May-2023
https://dl.acm.org/doi/10.1007/s00500-023-08125-9
Du ZLi JWang YLi XTan GSun NWolf FShende SCulhane CAlam SJagode H(2022)AlphaSparseProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571972(1-15)Online publication date: 13-Nov-2022
https://dl.acm.org/doi/10.5555/3571885.3571972
Cheshmi KCetinic ZDehnavi MWolf FShende SCulhane CAlam SJagode H(2022)Vectorizing sparse matrix computations with partially-strided codeletsProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571927(1-15)Online publication date: 13-Nov-2022
https://dl.acm.org/doi/10.5555/3571885.3571927
Isotton GJanna CBernaschi M(2022)A GPU-accelerated adaptive FSAI preconditioner for massively parallel simulationsInternational Journal of High Performance Computing Applications10.1177/1094342021101718836:2(153-166)Online publication date: 1-Mar-2022
https://dl.acm.org/doi/10.1177/10943420211017188
Gao JJi WTan ZWang YShi F(2022)TaiChi: A Hybrid Compression Format for Binary Sparse Matrix-Vector Multiplication on GPUIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.317050133:12(3732-3745)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1109/TPDS.2022.3170501
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents