research-article

Pipelined Iterative Solvers with Kernel Fusion for Graphics Processing Units

Authors:

Ansgar Jüngel,

Tibor GrasserAuthors Info & Claims

ACM Transactions on Mathematical Software (TOMS), Volume 43, Issue 2

Article No.: 11, Pages 1 - 27

https://doi.org/10.1145/2907944

Published: 16 August 2016 Publication History

Abstract

We revisit the implementation of iterative solvers on discrete graphics processing units and demonstrate the benefit of implementations using extensive kernel fusion for pipelined formulations over conventional implementations of classical formulations. The proposed implementations with both CUDA and OpenCL are freely available in ViennaCL and are shown to be competitive with or even superior to other solver packages for graphics processing units. The highest-performance gains are obtained for small to medium-sized systems, while our implementations are on par with vendor-tuned implementations for very large systems. Our results are especially beneficial for transient problems, where many small to medium-sized systems instead of a single big system need to be solved.

References

[1]

J. I. Aliaga, J. Perez, E. S. Quintana-Orti, and H. Anzt. 2013. Reformulated conjugate gradient for the energy-aware solution of linear systems on GPUs. In Proc. Intl. Conf. Par. Proc. 320--329.

Digital Library

[2]

H. Anzt, W. Sawyer, S. Tomov, P. Luszczek, I. Yamazaki, and J. Dongarra. 2014. Optimizing Krylov subspace solvers on graphics processing units. In IEEE Intl. Conf. Par. Dist. Sys. Workshops. 941--949.

Digital Library

[3]

A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarathy, and P. Sadayappan. 2014. Fast sparse matrix-vector multiplication on GPUs for graph applications. In Proc. HPC Netw., Stor. Anal. (SC'14). ACM, 781--792.

Digital Library

[4]

D. Barkai, K. J. M. Moriarty, and C. Rebbi. 1985. A modified conjugate gradient solver for very large systems. Comp. Phys. Comm. 36, 1 (1985), 1--8.

[5]

M. M. Baskaran and R. Bordawekar. 2008. Optimizing sparse matrix-vector multiplication on GPUs. IBM RC24704 (2008).

[6]

N. Bell, S. Dalton, and L. Olson. 2012. Exposing fine-grained parallelism in algebraic multigrid methods. SIAM J. Sci. Comp. 34, 4 (2012), C123--C152.

[7]

N. Bell and M. Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proc. HPC Netw., Stor. Anal. (SC'09). ACM, Article 18, 11 pages.

Digital Library

[8]

A. T. Chronopoulos and C. W. Gear. 1989. S-step iterative methods for symmetric linear systems. J. Comp. Appl. Math. 25, 2 (1989), 153--168.

Digital Library

[9]

M. M. Dehnavi, D. M. Fernandez, J. Gaudiot, and D. D. Giannacopoulos. 2013. Parallel sparse approximate inverse preconditioning on graphic processing units. IEEE Trans. Par. Dist. Sys. 24, 9 (Sept. 2013), 1852--1862.

Digital Library

[10]

J. Fang, A. L. Varbanescu, and H. Sips. 2011. A comprehensive performance comparison of CUDA and OpenCL. In Proc. Intl. Conf. Par. Proc. 216--225.

Digital Library

[11]

I. Foster. 1995. Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering. Addison-Wesley.

Digital Library

[12]

R. Gandham, K. Esler, and Y. Zhang. 2014. A GPU accelerated aggregation algebraic multigrid method. Comput. Math. Appl. 68, 10 (2014), 1151--1160.

Digital Library

[13]

P. Ghysels, T. J. Ashby, K. Meerbergen, and W. Vanroose. 2013. Hiding global communication latency in the GMRES algorithm on massively parallel machines. SIAM J. Sci. Comp. 35, 1 (2013), C48--C71.

Digital Library

[14]

P. Ghysels and W. Vanroose. 2014. Hiding global synchronization latency in the preconditioned conjugate gradient algorithm. Par. Comp. 40, 7 (2014), 224--238.

Digital Library

[15]

J. L. Greathouse and M. Daga. 2014. Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format. In Proc. HPC Netw., Stor. Anal. (SC'14). ACM, 769--780.

Digital Library

[16]

M. J. Harvey and G. De Fabritiis. 2011. Swan: A tool for porting CUDA programs to OpenCL. Comp. Phys. Comm. 182, 4 (2011), 1093--1099.

[17]

M. R. Hestenes and E. Stiefel. 1952. Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bureau Standards 49, 6 (1952), 409--436.

[18]

T. Jacques, L. Nicolas, and C. Vollaire. 1999. Electromagnetic scattering with the boundary integral method on MIMD systems. In High-Performance Computing and Networking. LNCS, Vol. 1593. Springer, 1025--1031.

Digital Library

[19]

K. Karimi, N. G. Dickson, and F. Hamze. 2010. A performance comparison of CUDA and OpenCL. arXiv e-Print 1005.2581 (2010).

[20]

K. Kim and V. Eijkhout. 2013. Scheduling a parallel sparse direct solver to multiple GPUs. In IEEE Intl. Conf. Par. Dist. Sys. Workshops. 1401--1408.

Digital Library

[21]

B. Krasnopolsky. 2010. The reordered BiCGStab method for distributed memory computer systems. Procedia Comp. Sci. 1, 1 (2010), 213--218.

[22]

M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop. 2014. A unified sparse matrix data format for efficient general sparse matrix-vector multiply on modern processors with wide SIMD units. SIAM J. Sci. Comp. 36, 5 (2014), C401--C423.

[23]

V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. 2010. Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU. In Proc. Intl Symp. Comp. Arch. ACM, 451--460.

Digital Library

[24]

R. Li and Y. Saad. 2013. GPU-accelerated preconditioned iterative linear solvers. J. Supercomp. 63, 2 (2013), 443--466.

Digital Library

[25]

X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey. 2013. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proc. HPC Netw., Stor. Anal. (SC'13). ACM, 273--282.

Digital Library

[26]

M. Lukash, K. Rupp, and S. Selberherr. 2012. Sparse approximate inverse preconditioners for iterative solvers on GPUs. In Proc. HPC Symp. SCS, Article 13, 8 pages.

Digital Library

[27]

G. Martinez, M. Gardner, and Wu chun Feng. 2011. CU2CL: A CUDA-to-OpenCL translator for multi- and many-core architectures. In IEEE Intl. Conf. Par. Dist. Sys. 300--307.

Digital Library

[28]

M. Naumov. 2012. Preconditioned block-iterative methods on GPUs. PAMM 12, 1 (2012), 11--14.

[29]

J. Nickolls, I. Buck, M. Garland, and K. Skadron. 2008. Scalable parallel programming with CUDA. Queue 6, 2 (2008), 40--53.

Digital Library

[30]

C. Richter, S. Schops, and M. Clemens. 2014. GPU acceleration of algebraic multigrid preconditioners for discrete elliptic field problems. IEEE Trans. Magn. 50, 2 (Feb. 2014), 461--464.

[31]

K. Rupp, Ph. Tillet, B. Smith, K.-T. Grasser, and A. Jüngel. 2013. A note on the GPU acceleration of eigenvalue computations. In AIP Proc., Vol. 1558. 1536--1539.

[32]

Y. Saad. 1985. Practical use of polynomial preconditionings for the conjugate gradient method. SIAM J. Sci. Stat. Comp. 6, 4 (1985), 865--881.

Digital Library

[33]

Y. Saad. 2003. Iterative Methods for Sparse Linear Systems. 2nd ed. SIAM.

Digital Library

[34]

Y. Saad and M. H. Schultz. 1986. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comp. 7, 3 (1986), 856--869.

Digital Library

[35]

E. Saule, K. Kaya, and Ü. Catalyürek. 2014. Performance evaluation of sparse matrix multiplication kernels on intel xeon phi. In Parallel Processing and Applied Mathematics. Springer, Berlin, 559--570.

[36]

W. Sawyer, C. Vanini, G. Fourestey, and R. Popescu. 2012. SPAI preconditioners for HPC applications. Proc. Appl. Math. Mech. 12, 1 (2012), 651--652.

[37]

O. Schenk, M. Christen, and H. Burkhart. 2008. Algorithmic performance studies on graphics processing units. J. Par. Dist. Comp. 68, 10 (2008), 1360--1369.

Digital Library

[38]

R. Strzodka and D. Göddeke. 2006. Pipelined mixed precision algorithms on FPGAs for fast and accurate PDE solvers from low precision components. In Proc. IEEE FCCM. IEEE Computer Society, 259--270.

Digital Library

[39]

H. van der Vorst. 1992. Bi-CGSTAB: A fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems. SIAM J. Sci. Stat. Comp. 13, 2 (1992), 631--644.

Digital Library

[40]

M. Wagner, K. Rupp, and J. Weinbub. 2012. A comparison of algebraic multigrid preconditioners using graphics processing units and multi-core central processing units. In Proc. HPC Symp. SCS, Article 2, 8 pages.

Digital Library

[41]

H. F. Walker and L. Zhou. 1994. A simpler GMRES. Num. Lin. Alg. Appl. 1, 6 (1994), 571--581.

[42]

I. Yamazaki, H. Anzt, S. Tomov, M. Hoemmen, and J. Dongarra. 2014. Improving the performance of CA-GMRES on multicores with multiple GPUs. In Proc. IEEE IPDPS. IEEE Computer Society, 382--391.

Digital Library

[43]

L. T. Yang and R. P. Brent. 2002. The improved BiCGStab method for large and sparse unsymmetric linear systems on parallel distributed memory architectures. In Proc. Alg. Arch. Par. Proc. 324--328.

Digital Library

[44]

R. Yokota, J. P. Bardhan, M. G. Knepley, L. A. Barba, and T. Hamada. 2011. Biomolecular electrostatics using a fast multipole bem on up to 512 GPUs and a billion unknowns. Comp. Phys. Comm. 182, 6 (2011), 1272--1283.

Cited By

Kronbichler MSashko DMunch P(2023)Enhancing data locality of the conjugate gradient method for high-order matrix-free finite-element implementationsInternational Journal of High Performance Computing Applications10.1177/1094342022110788037:2(61-81)Online publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1177/10943420221107880
Vargas AStitt TWeiss KTomov VCamier JKolev TRieben R(2022)Matrix-free approaches for GPU acceleration of a high-order finite element hydrodynamics application using MFEM, Umpire, and RAJAInternational Journal of High Performance Computing Applications10.1177/1094342022110026236:4(492-509)Online publication date: 1-Jul-2022
https://dl.acm.org/doi/10.1177/10943420221100262
Basic JBlagojevic BBasic MSikora M(2021)Parallelism and Iterative bi-Lanczos Solvers2021 6th International Conference on Smart and Sustainable Technologies (SpliTech)10.23919/SpliTech52315.2021.9566330(1-6)Online publication date: 8-Sep-2021
https://doi.org/10.23919/SpliTech52315.2021.9566330
Show More Cited By

Index Terms

Pipelined Iterative Solvers with Kernel Fusion for Graphics Processing Units
1. Computing methodologies
  1. Symbolic and algebraic manipulation
    1. Symbolic and algebraic algorithms
      1. Linear algebra algorithms
2. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Computations on matrices

Recommendations

An Effective Approach for Implementing Sparse Matrix-Vector Multiplication on Graphics Processing Units
HPCC '12: Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems

Sparse matrix vector multiplication, SpMV, is often a performance bottleneck in iterative solvers. Recently, Graphics Processing Units, GPUs, have been deployed to enhance the performance of this operation. We present a blocked version of the Transposed ...
ViennaCL---Linear Algebra Library for Multi- and Many-Core Architectures
^† Special Section on Two Themes: CSE Software and Big Data in CSE

CUDA, OpenCL, and OpenMP are popular programming models for the multicore architectures of CPUs and many-core architectures of GPUs or Xeon Phis. At the same time, computational scientists face the question of which programming model to use to obtain ...
CLBlast: A Tuned OpenCL BLAS Library
IWOCL '18: Proceedings of the International Workshop on OpenCL

This work introduces CLBlast, an open-source BLAS library providing optimized OpenCL routines to accelerate dense linear algebra for a wide variety of devices. It is targeted at machine learning and HPC applications and thus provides a fast matrix-...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software

ACM Transactions on Mathematical Software Volume 43, Issue 2

June 2017

200 pages

ISSN:0098-3500

EISSN:1557-7295

DOI:10.1145/2988256

Editor:
Michael A. Heroux
Sandia National Laboratories, USA

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 August 2016

Accepted: 01 March 2016

Revised: 01 October 2015

Received: 01 December 2014

Published in TOMS Volume 43, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Austrian Science Fund (FWF)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
238
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)3

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kronbichler MSashko DMunch P(2023)Enhancing data locality of the conjugate gradient method for high-order matrix-free finite-element implementationsInternational Journal of High Performance Computing Applications10.1177/1094342022110788037:2(61-81)Online publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1177/10943420221107880
Vargas AStitt TWeiss KTomov VCamier JKolev TRieben R(2022)Matrix-free approaches for GPU acceleration of a high-order finite element hydrodynamics application using MFEM, Umpire, and RAJAInternational Journal of High Performance Computing Applications10.1177/1094342022110026236:4(492-509)Online publication date: 1-Jul-2022
https://dl.acm.org/doi/10.1177/10943420221100262
Basic JBlagojevic BBasic MSikora M(2021)Parallelism and Iterative bi-Lanczos Solvers2021 6th International Conference on Smart and Sustainable Technologies (SpliTech)10.23919/SpliTech52315.2021.9566330(1-6)Online publication date: 8-Sep-2021
https://doi.org/10.23919/SpliTech52315.2021.9566330
Korch MWerner T(2020)Improving locality of explicit one-step methods on GPUs by tiling across stages and time stepsFuture Generation Computer Systems10.1016/j.future.2019.07.075102:C(889-901)Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.1016/j.future.2019.07.075
Kiran UGautam SSharma D(2020)GPU-based matrix-free finite element solver exploiting symmetry of elemental matricesComputing10.1007/s00607-020-00827-4102:9(1941-1965)Online publication date: 1-Sep-2020
https://dl.acm.org/doi/10.1007/s00607-020-00827-4
Krasnopolsky B(2019)Revisiting performance of BiCGStab methods for solving systems with multiple right-hand sidesComputers & Mathematics with Applications10.1016/j.camwa.2019.11.025Online publication date: Dec-2019
https://doi.org/10.1016/j.camwa.2019.11.025
Korch MWerner T(2018)Exploiting Limited Access Distance for Kernel Fusion Across the Stages of Explicit One-Step Methods on GPUs2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/CAHPC.2018.8645892(148-157)Online publication date: Sep-2018
https://doi.org/10.1109/CAHPC.2018.8645892
Lamas Daviña ARoman J(2018)MPI-CUDA parallel linear solvers for block-tridiagonal matrices in the context of SLEPc’s eigensolversParallel Computing10.1016/j.parco.2017.11.00674:C(118-135)Online publication date: 1-May-2018
https://dl.acm.org/doi/10.1016/j.parco.2017.11.006
Anzt HGates MDongarra JKreutzer MWellein GKhler M(2017)Preconditioned Krylov solvers on GPUsParallel Computing10.1016/j.parco.2017.05.00668:C(32-44)Online publication date: 1-Oct-2017
https://dl.acm.org/doi/10.1016/j.parco.2017.05.006
Kreutzer MThies JRöhrig-Zöllner MPieper AShahzad FGalgon MBasermann AFehske HHager GWellein G(2017)GHOSTInternational Journal of Parallel Programming10.1007/s10766-016-0464-z45:5(1046-1072)Online publication date: 1-Oct-2017
https://dl.acm.org/doi/10.1007/s10766-016-0464-z
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents