Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Optimization and performance evaluation of the IDR iterative Krylov solver on GPUs

Published: 01 March 2018 Publication History

Abstract

In this paper, we present an optimized GPU implementation for the induced dimension reduction algorithm. We improve data locality, combine it with an efficient sparse matrix vector kernel, and investigate the potential of overlapping computation with communication as well as the possibility of concurrent kernel execution. A comprehensive performance evaluation is conducted using a suitable performance model. The analysis reveals efficiency of up to 90%, which indicates that the implementation achieves performance close to the theoretically attainable bound.

References

[1]
2015 CUDA Toolkit v7.5 . NVIDIA Corporation.
[2]
2015 cuSPARSE Toolkit v7.0 . NVIDIA Corporation, v7.0 edition.
[3]
Aliaga J, Perez J, Quintana-Orti E . 2013</year> Reformulated conjugate gradient for the energy-aware solution of linear systems on GPUs. In: Proceedings of the 2013 42nd International Conference on Parallel Processing, IEEE Computer Society, Washington, DC, USA<year>2013, pp.pp.320-–329.
[4]
Aliaga JI, Pérez J, Quintana-Ortí ES 2015 Systematic fusion of CUDA kernels for iterative sparse linear system solvers. In: Euro-Par 2015: Parallel processing: 21st international conference on parallel and distributed computing, Vienna, Austria, 24-28 August 2015, pp.pp.675-–686. Berlin, Heidelberg: Springer.
[5]
Anzt H, Ponce E, Peterson GD . 2015</year>a GPU-accelerated co-design of induced dimension reduction: algorithmic fusion and kernel overlap. In: Proceedings of the 2nd international workshop on hardware-software co-design for high performance computing, Co-HPC<year>2015, pp.pp.5:1-–5:8. New York, NY: ACM. ISBN 978-1-4503-3992-6
[6]
Anzt H, Sawyer W, Tomov S . 2015b Acceleration of GPU-based Krylov solvers via data transfer reduction. International Journal of High Performance Computing Volume 29 : pp.366-–383.
[7]
Bell N, Garland M 2009</year> Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proceedings of the conference on high performance computing networking, storage and analysis, SC <year>2009, pp.pp.18:1-–18:11. New York, NY: ACM.
[8]
Bergman K, Borkar S, Campbell D . 2008 ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems Peter Kogge, Editor & Study Lead. DARPA/IPTO Program.
[9]
Blackford LS, Demmel J, Dongarra J . 2002 An updated set of basic linear algebra subprograms BLAS. ACM Transactions on Mathematical Software Volume 28 Issue 2: pp.135-–151.
[10]
Collignon TP, van Gijzen MB 2011 Minimizing synchronization in IDR s. Numerical Linear Algebra with Applications Volume 18 Issue 5: pp.805-–825.
[11]
Dorostkar A, Lukarski D, Lund B . 2014 CPU and GPU performance of large scale numerical simulations in geophysics. In: Euro-Par 2014: parallel processing workshops, Lecture Notes in Computer Science, volume 8805. Switzerland: Springer International Publishing, pp.pp.12-–23.
[12]
Dziekonski AL, Mrozowski M 2011 A memory efficient and fast sparse matrix vector product on a GPU. Progress In Electromagnetics Research Volume 116 : pp.49-–63.
[13]
Filipovic J, Madzin M, Fousek J . 2013</year> Optimizing CUDA code by kernel fusion-application on BLAS. CoRR abs/1305.<year>1183.
[14]
Gregg C, Dorn J, Hazelwood K . 2012 Fine-grained resource sharing for concurrent GPGPU kernels. In: 4th USENIX workshop on hot topics in parallelism, Berkeley, CA: USENIX.
[15]
Hestenes MR, Stiefel E 1952 Methods of conjugate gradients for solving linear systems. Journal of research of the National Bureau of Standards Volume 49 : pp.409-–436.
[16]
Jiao Q, Lu M, Huynh HP . 2015 Improving GPGPU energy-efficiency through concurrent kernel execution and DVFS. In: Proceedings of the 13th annual IEEE/ACM international symposium on code generation and optimization, CGO 2015, pp.pp.1-–11. Washington, DC: IEEE Computer Society.
[17]
Knibbe H, Oosterlee C, Vuik C 2011 GPU implementation of a Helmholtz Krylov solver preconditioned by a shifted Laplace multigrid method. Journal of Computational and Applied Mathematics Volume 236 Issue 3: pp.281-–293.
[18]
Kreutzer M, Hager G, Wellein G . 2014 A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM Journal on Scientific Computing Volume 36 Issue 5: pp.C401-–C423.
[19]
Kreutzer M, Thies J, Röhrig-Zöllner M . 2015 GHOST: building blocks for high performance sparse linear algebra on heterogeneous systems. CoRR abs/1507.08101.
[20]
Li R, Saad Y 2013 GPU-accelerated preconditioned iterative linear solvers. The Journal of Supercomputing Volume 63 Issue 2: pp.443-–466.
[21]
Lukash M, Rupp K, Selberherr S 2012 Sparse approximate inverse preconditioners for iterative solvers on GPUs. In: HPC 2012: Proceedings of the 2012 symposium on high performance computing, pp.pp.1-–8. San Diego, CA: Society for Computer Simulation International.
[22]
2015 MAGMA 1.6.2. Available at: <ext-link ext-link-type="uri" xlink:href="http://icl.cs.utk.edu/magma/">http://icl.cs.utk.edu/magma/</ext-link> accessed November 2015.
[23]
Lukarski D, Trost N 2015 PARALUTION. Available at: <ext-link ext-link-type="uri" xlink:href="http://www.paralution.com/PARALUTION">http://www.paralution.com/PARALUTION</ext-link> .
[24]
McCalpin JD 1995 Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture TCCA Newsletter 1995, pp.19-–25.
[25]
Monakov A, Lokhmotov A, Avetisyan A 2010</year> Automatically tuning sparse matrix-vector multiplication for GPU architectures. In: Proceedings of the 5th international conference on high performance embedded architectures and compilers, HiPEAC <year>2010, pp.pp.111-–125. Berlin, Heidelberg: Springer-Verlag.
[26]
Rendel O, Rizvanolli A, Zemke JPM 2013 IDR: a new generation of Krylov subspace methods? Linear Algebra and its Applications Volume 439 Issue 4: pp.1040-–1061.
[27]
Rupp K 2015 ViennaCL. Available at: <ext-link ext-link-type="uri" xlink:href="http://viennacl.sourceforge.net/">http://viennacl.sourceforge.net/</ext-link> .
[28]
Saad Y 2003 Iterative Methods for Sparse Linear Systems . Philadelphia, PA: Society for Industrial and Applied Mathematics.
[29]
Simoncini V, Szyld DB 2010 Interpreting IDR as a Petrov-Galerkin method. SIAM Journal on Scientific Computing Volume 32 Issue 4, pp.1898-–1912.
[30]
Sonneveld P, van Gijzen MB 2009 IDRs: A family of simple and fast algorithms for solving large nonsymmetric systems of linear equations. SIAM Journal on Scientific Computing Volume 31 Issue 2: pp.1035-–1062.
[31]
Strohmaier E, Dongarra J, Simon H, Meuer M 2015 The TOP500 list. Available at: <ext-link ext-link-type="uri" xlink:href="http://www.top.org/">http://www.top.org/</ext-link> .
[32]
Tabik S, López GO, Garzón EM 2014 Performance evaluation of kernel fusion BLAS routines on the GPU: iterative solvers as case study. The Journal of Supercomputing Volume 70 Issue 2: pp.577-–587.
[33]
van Gijzen MB 2015 The induced dimension reduction method. Available at: <ext-link ext-link-type="uri" xlink:href="http://ta.twi.tudelft.nl/nw/users/gijzen/IDR.html">http://ta.twi.tudelft.nl/nw/users/gijzen/IDR.html</ext-link> .
[34]
van Gijzen MB, Sleijpen GLG, Zemke JPM 2015 Flexible and multi-shift induced dimension reduction algorithms for solving large sparse linear systems. Numerical Linear Algebra with Applications Volume 22 Issue 1: pp.1-–25.
[35]
van Gijzen MB, Sonneveld P 2011 Algorithm 913: an elegant IDRs variant that efficiently exploits biorthogonality properties. ACM Transactions on Mathematical Software Volume 38 Issue 1: pp.5:1-–5:19.
[36]
Vázquez F, Fernández JJ, Garzón EM 2011 A new approach for sparse matrix vector product on NVIDIA GPUs. Concurrency and Computation: Practice and Experience Volume 23 Issue 8: pp.815-–826.
[37]
Wang G, Lin Y, Yi W 2010</year> Kernel fusion: an effective method for better power efficiency on multithreaded GPU. In: Proceedings of the 2010 IEEE/ACM international conference on green computing and communications & international conference on cyber, physical and social computing, GREENCOM-CPSCOM <year>2010, pp.pp.344-–350. Washington, DC: IEEE Computer Society.
[38]
Wang L, Huang M, El-Ghazawi T 2011 Exploiting concurrent kernel execution on graphic processing units. In: High performance computing and simulation HPCS, 2011 international conference on, Washington, DC: IEEE, pp.pp.24-–32.
[39]
Williams S, Waterman A, Patterson D 2009 Roofline: an insightful visual performance model for multicore architectures. Communications of the Association for Computing Machinery Volume 52 Issue 4: pp.65-–76.

Cited By

View all
  • (2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
  • (2017)Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioner Generation on GPUsProceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3026937.3026940(1-10)Online publication date: 4-Feb-2017

Recommendations

Comments

Information & Contributors

Information

Published In

cover image International Journal of High Performance Computing Applications
International Journal of High Performance Computing Applications  Volume 32, Issue 2
3 2018
107 pages

Publisher

Sage Publications, Inc.

United States

Publication History

Published: 01 March 2018

Author Tags

  1. GPU
  2. Induced dimension reduction IDR
  3. co-design
  4. kernel fusion
  5. kernel overlap
  6. roofline performance model

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
  • (2017)Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioner Generation on GPUsProceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3026937.3026940(1-10)Online publication date: 4-Feb-2017

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media