Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Pipelined Iterative Solvers with Kernel Fusion for Graphics Processing Units

Published: 16 August 2016 Publication History

Abstract

We revisit the implementation of iterative solvers on discrete graphics processing units and demonstrate the benefit of implementations using extensive kernel fusion for pipelined formulations over conventional implementations of classical formulations. The proposed implementations with both CUDA and OpenCL are freely available in ViennaCL and are shown to be competitive with or even superior to other solver packages for graphics processing units. The highest-performance gains are obtained for small to medium-sized systems, while our implementations are on par with vendor-tuned implementations for very large systems. Our results are especially beneficial for transient problems, where many small to medium-sized systems instead of a single big system need to be solved.

References

[1]
J. I. Aliaga, J. Perez, E. S. Quintana-Orti, and H. Anzt. 2013. Reformulated conjugate gradient for the energy-aware solution of linear systems on GPUs. In Proc. Intl. Conf. Par. Proc. 320--329.
[2]
H. Anzt, W. Sawyer, S. Tomov, P. Luszczek, I. Yamazaki, and J. Dongarra. 2014. Optimizing Krylov subspace solvers on graphics processing units. In IEEE Intl. Conf. Par. Dist. Sys. Workshops. 941--949.
[3]
A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarathy, and P. Sadayappan. 2014. Fast sparse matrix-vector multiplication on GPUs for graph applications. In Proc. HPC Netw., Stor. Anal. (SC'14). ACM, 781--792.
[4]
D. Barkai, K. J. M. Moriarty, and C. Rebbi. 1985. A modified conjugate gradient solver for very large systems. Comp. Phys. Comm. 36, 1 (1985), 1--8.
[5]
M. M. Baskaran and R. Bordawekar. 2008. Optimizing sparse matrix-vector multiplication on GPUs. IBM RC24704 (2008).
[6]
N. Bell, S. Dalton, and L. Olson. 2012. Exposing fine-grained parallelism in algebraic multigrid methods. SIAM J. Sci. Comp. 34, 4 (2012), C123--C152.
[7]
N. Bell and M. Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proc. HPC Netw., Stor. Anal. (SC'09). ACM, Article 18, 11 pages.
[8]
A. T. Chronopoulos and C. W. Gear. 1989. S-step iterative methods for symmetric linear systems. J. Comp. Appl. Math. 25, 2 (1989), 153--168.
[9]
M. M. Dehnavi, D. M. Fernandez, J. Gaudiot, and D. D. Giannacopoulos. 2013. Parallel sparse approximate inverse preconditioning on graphic processing units. IEEE Trans. Par. Dist. Sys. 24, 9 (Sept. 2013), 1852--1862.
[10]
J. Fang, A. L. Varbanescu, and H. Sips. 2011. A comprehensive performance comparison of CUDA and OpenCL. In Proc. Intl. Conf. Par. Proc. 216--225.
[11]
I. Foster. 1995. Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering. Addison-Wesley.
[12]
R. Gandham, K. Esler, and Y. Zhang. 2014. A GPU accelerated aggregation algebraic multigrid method. Comput. Math. Appl. 68, 10 (2014), 1151--1160.
[13]
P. Ghysels, T. J. Ashby, K. Meerbergen, and W. Vanroose. 2013. Hiding global communication latency in the GMRES algorithm on massively parallel machines. SIAM J. Sci. Comp. 35, 1 (2013), C48--C71.
[14]
P. Ghysels and W. Vanroose. 2014. Hiding global synchronization latency in the preconditioned conjugate gradient algorithm. Par. Comp. 40, 7 (2014), 224--238.
[15]
J. L. Greathouse and M. Daga. 2014. Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format. In Proc. HPC Netw., Stor. Anal. (SC'14). ACM, 769--780.
[16]
M. J. Harvey and G. De Fabritiis. 2011. Swan: A tool for porting CUDA programs to OpenCL. Comp. Phys. Comm. 182, 4 (2011), 1093--1099.
[17]
M. R. Hestenes and E. Stiefel. 1952. Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bureau Standards 49, 6 (1952), 409--436.
[18]
T. Jacques, L. Nicolas, and C. Vollaire. 1999. Electromagnetic scattering with the boundary integral method on MIMD systems. In High-Performance Computing and Networking. LNCS, Vol. 1593. Springer, 1025--1031.
[19]
K. Karimi, N. G. Dickson, and F. Hamze. 2010. A performance comparison of CUDA and OpenCL. arXiv e-Print 1005.2581 (2010).
[20]
K. Kim and V. Eijkhout. 2013. Scheduling a parallel sparse direct solver to multiple GPUs. In IEEE Intl. Conf. Par. Dist. Sys. Workshops. 1401--1408.
[21]
B. Krasnopolsky. 2010. The reordered BiCGStab method for distributed memory computer systems. Procedia Comp. Sci. 1, 1 (2010), 213--218.
[22]
M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop. 2014. A unified sparse matrix data format for efficient general sparse matrix-vector multiply on modern processors with wide SIMD units. SIAM J. Sci. Comp. 36, 5 (2014), C401--C423.
[23]
V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. 2010. Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU. In Proc. Intl Symp. Comp. Arch. ACM, 451--460.
[24]
R. Li and Y. Saad. 2013. GPU-accelerated preconditioned iterative linear solvers. J. Supercomp. 63, 2 (2013), 443--466.
[25]
X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey. 2013. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proc. HPC Netw., Stor. Anal. (SC'13). ACM, 273--282.
[26]
M. Lukash, K. Rupp, and S. Selberherr. 2012. Sparse approximate inverse preconditioners for iterative solvers on GPUs. In Proc. HPC Symp. SCS, Article 13, 8 pages.
[27]
G. Martinez, M. Gardner, and Wu chun Feng. 2011. CU2CL: A CUDA-to-OpenCL translator for multi- and many-core architectures. In IEEE Intl. Conf. Par. Dist. Sys. 300--307.
[28]
M. Naumov. 2012. Preconditioned block-iterative methods on GPUs. PAMM 12, 1 (2012), 11--14.
[29]
J. Nickolls, I. Buck, M. Garland, and K. Skadron. 2008. Scalable parallel programming with CUDA. Queue 6, 2 (2008), 40--53.
[30]
C. Richter, S. Schops, and M. Clemens. 2014. GPU acceleration of algebraic multigrid preconditioners for discrete elliptic field problems. IEEE Trans. Magn. 50, 2 (Feb. 2014), 461--464.
[31]
K. Rupp, Ph. Tillet, B. Smith, K.-T. Grasser, and A. Jüngel. 2013. A note on the GPU acceleration of eigenvalue computations. In AIP Proc., Vol. 1558. 1536--1539.
[32]
Y. Saad. 1985. Practical use of polynomial preconditionings for the conjugate gradient method. SIAM J. Sci. Stat. Comp. 6, 4 (1985), 865--881.
[33]
Y. Saad. 2003. Iterative Methods for Sparse Linear Systems. 2nd ed. SIAM.
[34]
Y. Saad and M. H. Schultz. 1986. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comp. 7, 3 (1986), 856--869.
[35]
E. Saule, K. Kaya, and Ü. Catalyürek. 2014. Performance evaluation of sparse matrix multiplication kernels on intel xeon phi. In Parallel Processing and Applied Mathematics. Springer, Berlin, 559--570.
[36]
W. Sawyer, C. Vanini, G. Fourestey, and R. Popescu. 2012. SPAI preconditioners for HPC applications. Proc. Appl. Math. Mech. 12, 1 (2012), 651--652.
[37]
O. Schenk, M. Christen, and H. Burkhart. 2008. Algorithmic performance studies on graphics processing units. J. Par. Dist. Comp. 68, 10 (2008), 1360--1369.
[38]
R. Strzodka and D. Göddeke. 2006. Pipelined mixed precision algorithms on FPGAs for fast and accurate PDE solvers from low precision components. In Proc. IEEE FCCM. IEEE Computer Society, 259--270.
[39]
H. van der Vorst. 1992. Bi-CGSTAB: A fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems. SIAM J. Sci. Stat. Comp. 13, 2 (1992), 631--644.
[40]
M. Wagner, K. Rupp, and J. Weinbub. 2012. A comparison of algebraic multigrid preconditioners using graphics processing units and multi-core central processing units. In Proc. HPC Symp. SCS, Article 2, 8 pages.
[41]
H. F. Walker and L. Zhou. 1994. A simpler GMRES. Num. Lin. Alg. Appl. 1, 6 (1994), 571--581.
[42]
I. Yamazaki, H. Anzt, S. Tomov, M. Hoemmen, and J. Dongarra. 2014. Improving the performance of CA-GMRES on multicores with multiple GPUs. In Proc. IEEE IPDPS. IEEE Computer Society, 382--391.
[43]
L. T. Yang and R. P. Brent. 2002. The improved BiCGStab method for large and sparse unsymmetric linear systems on parallel distributed memory architectures. In Proc. Alg. Arch. Par. Proc. 324--328.
[44]
R. Yokota, J. P. Bardhan, M. G. Knepley, L. A. Barba, and T. Hamada. 2011. Biomolecular electrostatics using a fast multipole bem on up to 512 GPUs and a billion unknowns. Comp. Phys. Comm. 182, 6 (2011), 1272--1283.

Cited By

View all
  • (2023)Enhancing data locality of the conjugate gradient method for high-order matrix-free finite-element implementationsInternational Journal of High Performance Computing Applications10.1177/1094342022110788037:2(61-81)Online publication date: 1-Mar-2023
  • (2022)Matrix-free approaches for GPU acceleration of a high-order finite element hydrodynamics application using MFEM, Umpire, and RAJAInternational Journal of High Performance Computing Applications10.1177/1094342022110026236:4(492-509)Online publication date: 1-Jul-2022
  • (2021)Parallelism and Iterative bi-Lanczos Solvers2021 6th International Conference on Smart and Sustainable Technologies (SpliTech)10.23919/SpliTech52315.2021.9566330(1-6)Online publication date: 8-Sep-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software
ACM Transactions on Mathematical Software  Volume 43, Issue 2
June 2017
200 pages
ISSN:0098-3500
EISSN:1557-7295
DOI:10.1145/2988256
  • Editor:
  • Michael A. Heroux
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 August 2016
Accepted: 01 March 2016
Revised: 01 October 2015
Received: 01 December 2014
Published in TOMS Volume 43, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. BiCGStab method
  2. CUDA
  3. GMRES method
  4. GPU
  5. Iterative solvers
  6. OpenCL
  7. conjugate gradient method

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Austrian Science Fund (FWF)

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)3
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Enhancing data locality of the conjugate gradient method for high-order matrix-free finite-element implementationsInternational Journal of High Performance Computing Applications10.1177/1094342022110788037:2(61-81)Online publication date: 1-Mar-2023
  • (2022)Matrix-free approaches for GPU acceleration of a high-order finite element hydrodynamics application using MFEM, Umpire, and RAJAInternational Journal of High Performance Computing Applications10.1177/1094342022110026236:4(492-509)Online publication date: 1-Jul-2022
  • (2021)Parallelism and Iterative bi-Lanczos Solvers2021 6th International Conference on Smart and Sustainable Technologies (SpliTech)10.23919/SpliTech52315.2021.9566330(1-6)Online publication date: 8-Sep-2021
  • (2020)Improving locality of explicit one-step methods on GPUs by tiling across stages and time stepsFuture Generation Computer Systems10.1016/j.future.2019.07.075102:C(889-901)Online publication date: 1-Jan-2020
  • (2020)GPU-based matrix-free finite element solver exploiting symmetry of elemental matricesComputing10.1007/s00607-020-00827-4102:9(1941-1965)Online publication date: 1-Sep-2020
  • (2019)Revisiting performance of BiCGStab methods for solving systems with multiple right-hand sidesComputers & Mathematics with Applications10.1016/j.camwa.2019.11.025Online publication date: Dec-2019
  • (2018)Exploiting Limited Access Distance for Kernel Fusion Across the Stages of Explicit One-Step Methods on GPUs2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/CAHPC.2018.8645892(148-157)Online publication date: Sep-2018
  • (2018)MPI-CUDA parallel linear solvers for block-tridiagonal matrices in the context of SLEPc’s eigensolversParallel Computing10.1016/j.parco.2017.11.00674:C(118-135)Online publication date: 1-May-2018
  • (2017)Preconditioned Krylov solvers on GPUsParallel Computing10.1016/j.parco.2017.05.00668:C(32-44)Online publication date: 1-Oct-2017
  • (2017)GHOSTInternational Journal of Parallel Programming10.1007/s10766-016-0464-z45:5(1046-1072)Online publication date: 1-Oct-2017
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media