research-article

Optimization and performance evaluation of the IDR iterative Krylov solver on GPUs

Authors:

Moritz Kreutzer,

Gregory D Peterson,

Gerhard Wellein,

Jack DongarraAuthors Info & Claims

International Journal of High Performance Computing Applications, Volume 32, Issue 2

Pages 220 - 230

https://doi.org/10.1177/1094342016646844

Published: 01 March 2018 Publication History

Abstract

In this paper, we present an optimized GPU implementation for the induced dimension reduction algorithm. We improve data locality, combine it with an efficient sparse matrix vector kernel, and investigate the potential of overlapping computation with communication as well as the possibility of concurrent kernel execution. A comprehensive performance evaluation is conducted using a suitable performance model. The analysis reveals efficiency of up to 90%, which indicates that the implementation achieves performance close to the theoretically attainable bound.

References

[1]

2015 CUDA Toolkit v7.5 . NVIDIA Corporation.

[2]

2015 cuSPARSE Toolkit v7.0 . NVIDIA Corporation, v7.0 edition.

[3]

Aliaga J, Perez J, Quintana-Orti E . 2013</year> Reformulated conjugate gradient for the energy-aware solution of linear systems on GPUs. In: Proceedings of the 2013 42nd International Conference on Parallel Processing, IEEE Computer Society, Washington, DC, USA<year>2013, pp.pp.320-–329.

Digital Library

[4]

Aliaga JI, Pérez J, Quintana-Ortí ES 2015 Systematic fusion of CUDA kernels for iterative sparse linear system solvers. In: Euro-Par 2015: Parallel processing: 21st international conference on parallel and distributed computing, Vienna, Austria, 24-28 August 2015, pp.pp.675-–686. Berlin, Heidelberg: Springer.

[5]

Anzt H, Ponce E, Peterson GD . 2015</year>a GPU-accelerated co-design of induced dimension reduction: algorithmic fusion and kernel overlap. In: Proceedings of the 2nd international workshop on hardware-software co-design for high performance computing, Co-HPC<year>2015, pp.pp.5:1-–5:8. New York, NY: ACM. ISBN 978-1-4503-3992-6

Digital Library

[6]

Anzt H, Sawyer W, Tomov S . 2015b Acceleration of GPU-based Krylov solvers via data transfer reduction. International Journal of High Performance Computing Volume 29 : pp.366-–383.

Digital Library

[7]

Bell N, Garland M 2009</year> Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proceedings of the conference on high performance computing networking, storage and analysis, SC <year>2009, pp.pp.18:1-–18:11. New York, NY: ACM.

Digital Library

[8]

Bergman K, Borkar S, Campbell D . 2008 ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems Peter Kogge, Editor & Study Lead. DARPA/IPTO Program.

[9]

Blackford LS, Demmel J, Dongarra J . 2002 An updated set of basic linear algebra subprograms BLAS. ACM Transactions on Mathematical Software Volume 28 Issue 2: pp.135-–151.

Digital Library

[10]

Collignon TP, van Gijzen MB 2011 Minimizing synchronization in IDR s. Numerical Linear Algebra with Applications Volume 18 Issue 5: pp.805-–825.

[11]

Dorostkar A, Lukarski D, Lund B . 2014 CPU and GPU performance of large scale numerical simulations in geophysics. In: Euro-Par 2014: parallel processing workshops, Lecture Notes in Computer Science, volume 8805. Switzerland: Springer International Publishing, pp.pp.12-–23.

Digital Library

[12]

Dziekonski AL, Mrozowski M 2011 A memory efficient and fast sparse matrix vector product on a GPU. Progress In Electromagnetics Research Volume 116 : pp.49-–63.

[13]

Filipovic J, Madzin M, Fousek J . 2013</year> Optimizing CUDA code by kernel fusion-application on BLAS. CoRR abs/1305.<year>1183.

[14]

Gregg C, Dorn J, Hazelwood K . 2012 Fine-grained resource sharing for concurrent GPGPU kernels. In: 4th USENIX workshop on hot topics in parallelism, Berkeley, CA: USENIX.

Digital Library

[15]

Hestenes MR, Stiefel E 1952 Methods of conjugate gradients for solving linear systems. Journal of research of the National Bureau of Standards Volume 49 : pp.409-–436.

[16]

Jiao Q, Lu M, Huynh HP . 2015 Improving GPGPU energy-efficiency through concurrent kernel execution and DVFS. In: Proceedings of the 13th annual IEEE/ACM international symposium on code generation and optimization, CGO 2015, pp.pp.1-–11. Washington, DC: IEEE Computer Society.

Digital Library

[17]

Knibbe H, Oosterlee C, Vuik C 2011 GPU implementation of a Helmholtz Krylov solver preconditioned by a shifted Laplace multigrid method. Journal of Computational and Applied Mathematics Volume 236 Issue 3: pp.281-–293.

Digital Library

[18]

Kreutzer M, Hager G, Wellein G . 2014 A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM Journal on Scientific Computing Volume 36 Issue 5: pp.C401-–C423.

[19]

Kreutzer M, Thies J, Röhrig-Zöllner M . 2015 GHOST: building blocks for high performance sparse linear algebra on heterogeneous systems. CoRR abs/1507.08101.

[20]

Li R, Saad Y 2013 GPU-accelerated preconditioned iterative linear solvers. The Journal of Supercomputing Volume 63 Issue 2: pp.443-–466.

Digital Library

[21]

Lukash M, Rupp K, Selberherr S 2012 Sparse approximate inverse preconditioners for iterative solvers on GPUs. In: HPC 2012: Proceedings of the 2012 symposium on high performance computing, pp.pp.1-–8. San Diego, CA: Society for Computer Simulation International.

Digital Library

[22]

2015 MAGMA 1.6.2. Available at: <ext-link ext-link-type="uri" xlink:href="http://icl.cs.utk.edu/magma/">http://icl.cs.utk.edu/magma/</ext-link> accessed November 2015.

[23]

Lukarski D, Trost N 2015 PARALUTION. Available at: <ext-link ext-link-type="uri" xlink:href="http://www.paralution.com/PARALUTION">http://www.paralution.com/PARALUTION</ext-link> .

[24]

McCalpin JD 1995 Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture TCCA Newsletter 1995, pp.19-–25.

[25]

Monakov A, Lokhmotov A, Avetisyan A 2010</year> Automatically tuning sparse matrix-vector multiplication for GPU architectures. In: Proceedings of the 5th international conference on high performance embedded architectures and compilers, HiPEAC <year>2010, pp.pp.111-–125. Berlin, Heidelberg: Springer-Verlag.

Digital Library

[26]

Rendel O, Rizvanolli A, Zemke JPM 2013 IDR: a new generation of Krylov subspace methods? Linear Algebra and its Applications Volume 439 Issue 4: pp.1040-–1061.

[27]

Rupp K 2015 ViennaCL. Available at: <ext-link ext-link-type="uri" xlink:href="http://viennacl.sourceforge.net/">http://viennacl.sourceforge.net/</ext-link> .

[28]

Saad Y 2003 Iterative Methods for Sparse Linear Systems . Philadelphia, PA: Society for Industrial and Applied Mathematics.

Digital Library

[29]

Simoncini V, Szyld DB 2010 Interpreting IDR as a Petrov-Galerkin method. SIAM Journal on Scientific Computing Volume 32 Issue 4, pp.1898-–1912.

Digital Library

[30]

Sonneveld P, van Gijzen MB 2009 IDRs: A family of simple and fast algorithms for solving large nonsymmetric systems of linear equations. SIAM Journal on Scientific Computing Volume 31 Issue 2: pp.1035-–1062.

Digital Library

[31]

Strohmaier E, Dongarra J, Simon H, Meuer M 2015 The TOP500 list. Available at: <ext-link ext-link-type="uri" xlink:href="http://www.top.org/">http://www.top.org/</ext-link> .

[32]

Tabik S, López GO, Garzón EM 2014 Performance evaluation of kernel fusion BLAS routines on the GPU: iterative solvers as case study. The Journal of Supercomputing Volume 70 Issue 2: pp.577-–587.

Digital Library

[33]

van Gijzen MB 2015 The induced dimension reduction method. Available at: <ext-link ext-link-type="uri" xlink:href="http://ta.twi.tudelft.nl/nw/users/gijzen/IDR.html">http://ta.twi.tudelft.nl/nw/users/gijzen/IDR.html</ext-link> .

[34]

van Gijzen MB, Sleijpen GLG, Zemke JPM 2015 Flexible and multi-shift induced dimension reduction algorithms for solving large sparse linear systems. Numerical Linear Algebra with Applications Volume 22 Issue 1: pp.1-–25.

[35]

van Gijzen MB, Sonneveld P 2011 Algorithm 913: an elegant IDRs variant that efficiently exploits biorthogonality properties. ACM Transactions on Mathematical Software Volume 38 Issue 1: pp.5:1-–5:19.

Digital Library

[36]

Vázquez F, Fernández JJ, Garzón EM 2011 A new approach for sparse matrix vector product on NVIDIA GPUs. Concurrency and Computation: Practice and Experience Volume 23 Issue 8: pp.815-–826.

Digital Library

[37]

Wang G, Lin Y, Yi W 2010</year> Kernel fusion: an effective method for better power efficiency on multithreaded GPU. In: Proceedings of the 2010 IEEE/ACM international conference on green computing and communications & international conference on cyber, physical and social computing, GREENCOM-CPSCOM <year>2010, pp.pp.344-–350. Washington, DC: IEEE Computer Society.

Digital Library

[38]

Wang L, Huang M, El-Ghazawi T 2011 Exploiting concurrent kernel execution on graphic processing units. In: High performance computing and simulation HPCS, 2011 international conference on, Washington, DC: IEEE, pp.pp.24-–32.

[39]

Williams S, Waterman A, Patterson D 2009 Roofline: an insightful visual performance model for multicore architectures. Communications of the Association for Computing Machinery Volume 52 Issue 4: pp.65-–76.

Digital Library

Cited By

Hijma PHeldens SSclocco Avan Werkhoven BBal H(2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3570638
Anzt HDongarra JFlegar GQuintana-Ortí E(2017)Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioner Generation on GPUsProceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3026937.3026940(1-10)Online publication date: 4-Feb-2017
https://dl.acm.org/doi/10.1145/3026937.3026940

Optimization and performance evaluation of the IDR iterative Krylov solver on GPUs

Recommendations

Preconditioned Krylov solvers on GPUs

In this paper, we study the effect of enhancing GPU-accelerated Krylov solvers with preconditioners. We consider the BiCGSTAB, CGS, QMR, and IDR(s) Krylov solvers. For a large set of test matrices, we assess the impact of Jacobi and incomplete ...
Performance evaluation of kernel fusion BLAS routines on the GPU: iterative solvers as case study

Programmers usually implement iterative methods that solve partial differential equations by expressing them using a sequence of basic kernels from libraries optimized for the graphics processing unit (GPU). The global runtime of the resulting ...
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...

Comments

Information & Contributors

Information

Published In

cover image International Journal of High Performance Computing Applications

International Journal of High Performance Computing Applications Volume 32, Issue 2

3 2018

107 pages

Issue’s Table of Contents

Copyright © © The Authors 2016.

Publisher

Sage Publications, Inc.

United States

Publication History

Published: 01 March 2018

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hijma PHeldens SSclocco Avan Werkhoven BBal H(2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3570638
Anzt HDongarra JFlegar GQuintana-Ortí E(2017)Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioner Generation on GPUsProceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3026937.3026940(1-10)Online publication date: 4-Feb-2017
https://dl.acm.org/doi/10.1145/3026937.3026940

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents