Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Performance evaluation of kernel fusion BLAS routines on the GPU: iterative solvers as case study

Published: 01 November 2014 Publication History

Abstract

Programmers usually implement iterative methods that solve partial differential equations by expressing them using a sequence of basic kernels from libraries optimized for the graphics processing unit (GPU). The global runtime of the resulting combination is often penalized by the smallest and most inefficient vector operations. To improve the GPU exploitation, we identify and analyze the potential kernels to be fused according to the data dependence, data type and size, and GPU resources. This paper provides an extensive analysis of the impact of fusing vector operations [level 1 of Basic Linear Algebra Subprograms (BLAS)] on the performance of the GPU. The experimental evaluation shows that this optimization provides noticeable improvement especially for kernels with lower memory requirements and on more modern GPUs. It is worth noting that the fused BLAS operations can be very useful to help programmers efficiently code iterative methods to solve large linear systems of equations for the GPU. Iterative methods such as biconjugate gradient method (BCG) are one of the examples that can benefit from this optimization strategy. Indeed, kernel fusion of vector routines makes the most efficient GPU implementation of BCG run between $$1.09\times $$ 1.09 and $$1.27\times $$ 1.27 faster on three GPUs of different characteristics.

References

[1]
Dehnavi MM, Fernandez DM, Giannacopoulos D (2011) Enhancing the performance of conjugate gradient solvers on graphic processing units. IEEE Trans Magn 47(5):1162---1165
[2]
Filipoviă¿ J, Madzin M, Fousek J, Matyska L (2013) Optimizing cuda code by kernel fusion--application on BLAS. CoRR abs/1305.1183
[3]
Gaikwad A, Toke IM (2010) Parallel iterative linear solvers on GPU: a financial engineering case. In: Proceediongs of PDP, pp 607---614
[4]
Garcia N (2010) Parallel power flow solutions using a biconjugate gradient algorithm and a newton method: a GPU-based approach. In: IEEE Power and Energy Society general meeting, pp 1---4
[5]
Golub GH, van Van Loan CF (1996) Matrix computations (Johns Hopkins studies in mathematical sciences), 3rd edn. The Johns Hopkins University Press. Baltimore, MD
[6]
Haidar A, Ltaief H, Luszczek P, Dongarra J (2012) A comprehensive study of task coalescing for selecting parallelism granularity in a two-stage bidiagonal reduction. In: Proceedings of of IEEE IPDPS, pp 25---35
[7]
Hwu W (2011) Computing Gems Jade Edition. Applications of GPU computing series, Jade edn. Elsevier Science, Amsterdam
[8]
Lanczos C (1952) Solution of systems of linear equations by minimized iterations. J Res Natl Bur Stand 49:33---53
[9]
Lawson CL, Hanson RJ, Kincaid DR, Krogh FT (1979) Basic linear algebra subprograms for fortran usage. ACM Trans Math Softw 5(3):308---323
[10]
Navarro AG, Asenjo R, Tabik S, Cascaval C (2009) Analytical modeling of pipeline parallelism. In: Proceedings of PACT, pp 281---290. IEEE Computer Society
[11]
NVIDIA (2013) Du-06702-001\_v5.5 CUBLAS user guide. Technical report. http://docs.nvidia.com/cuda/pdf/CUBLAS_Library.pdf
[12]
NVIDIA (2013) Du-06709-001\_v5.5 CUSPARSE library. Technical report. http://docs.nvidia.com/cuda/pdf/CUSPARSE_Library.pdf
[13]
Ortega G, Garzón EM, Vázquez F, García I (2013) The biconjugate gradient method on GPUs. J Supercomput 64:49---58
[14]
Vázquez F, Fernández JJ, Garzón EM (2012) Automatic tuning of the sparse matrix vector product on GPUs based on the ELLR-T approach. Parallel Comput 38:408---420
[15]
Vázquez F, Ortega G, Fernández JJ, Garzón EM (2010) Improving the performance of the sparse matrix vector product with GPUs. In: Proceedings of IEEE CIT, pp 1146---1151. IEEE Computer Society
[16]
Wozniak M, Olas T, Wyrzykowski R (2010) Parallel implementation of conjugate gradient method on graphics processors. In: Parallel processing and applied mathematics, LNCS vol 6067, pp 125---135
[17]
Wu H, Diamos G, Wang J, Cadambi S, Yalamanchili S, Chakradhar S (2012) Optimizing data warehousing applications for GPUs using kernel fusion/fission. In: Proceedings of IEEE IPDPSW, pp 2433---2442

Cited By

View all
  • (2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
  • (2022)FIST-HOSVDProceedings of the Platform for Advanced Scientific Computing Conference10.1145/3539781.3539798(1-11)Online publication date: 27-Jun-2022
  • (2022)Real-time optical flow processing on embedded GPU: an hardware-aware algorithm to implementation strategyJournal of Real-Time Image Processing10.1007/s11554-021-01187-819:2(317-329)Online publication date: 1-Apr-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image The Journal of Supercomputing
The Journal of Supercomputing  Volume 70, Issue 2
November 2014
512 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 November 2014

Author Tags

  1. BCG
  2. BLAS 1
  3. GPU
  4. Iterative solvers
  5. Kernel fusion

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
  • (2022)FIST-HOSVDProceedings of the Platform for Advanced Scientific Computing Conference10.1145/3539781.3539798(1-11)Online publication date: 27-Jun-2022
  • (2022)Real-time optical flow processing on embedded GPU: an hardware-aware algorithm to implementation strategyJournal of Real-Time Image Processing10.1007/s11554-021-01187-819:2(317-329)Online publication date: 1-Apr-2022
  • (2019)Accelerating the task/data-parallel version of ILUPACK’s BiCG in multi-CPU/GPU configurationsParallel Computing10.1016/j.parco.2019.02.00585:C(79-87)Online publication date: 1-Jul-2019
  • (2018)Optimization and performance evaluation of the IDR iterative Krylov solver on GPUsInternational Journal of High Performance Computing Applications10.1177/109434201664684432:2(220-230)Online publication date: 1-Mar-2018
  • (2018)A tuning approach for iterative multiple 3d stencil pipeline on GPUsThe Journal of Supercomputing10.1007/s11227-017-2184-674:4(1580-1608)Online publication date: 1-Apr-2018
  • (2017)LogCAACM SIGARCH Computer Architecture News10.1145/3140659.308021645:2(375-388)Online publication date: 24-Jun-2017
  • (2017)LogCAProceedings of the 44th Annual International Symposium on Computer Architecture10.1145/3079856.3080216(375-388)Online publication date: 24-Jun-2017
  • (2017)GHOSTInternational Journal of Parallel Programming10.1007/s10766-016-0464-z45:5(1046-1072)Online publication date: 1-Oct-2017
  • (2015)GPU-accelerated co-design of induced dimension reductionProceedings of the 2nd International Workshop on Hardware-Software Co-Design for High Performance Computing10.1145/2834899.2834907(1-8)Online publication date: 15-Nov-2015
  • Show More Cited By

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media