article

Performance evaluation of kernel fusion BLAS routines on the GPU: iterative solvers as case study

Authors:

E. M. GarzónAuthors Info & Claims

The Journal of Supercomputing, Volume 70, Issue 2

Pages 577 - 587

https://doi.org/10.1007/s11227-014-1102-4

Published: 01 November 2014 Publication History

Abstract

Programmers usually implement iterative methods that solve partial differential equations by expressing them using a sequence of basic kernels from libraries optimized for the graphics processing unit (GPU). The global runtime of the resulting combination is often penalized by the smallest and most inefficient vector operations. To improve the GPU exploitation, we identify and analyze the potential kernels to be fused according to the data dependence, data type and size, and GPU resources. This paper provides an extensive analysis of the impact of fusing vector operations [level 1 of Basic Linear Algebra Subprograms (BLAS)] on the performance of the GPU. The experimental evaluation shows that this optimization provides noticeable improvement especially for kernels with lower memory requirements and on more modern GPUs. It is worth noting that the fused BLAS operations can be very useful to help programmers efficiently code iterative methods to solve large linear systems of equations for the GPU. Iterative methods such as biconjugate gradient method (BCG) are one of the examples that can benefit from this optimization strategy. Indeed, kernel fusion of vector routines makes the most efficient GPU implementation of BCG run between $$1.09\times $$ 1.09 and $$1.27\times $$ 1.27 faster on three GPUs of different characteristics.

References

[1]

Dehnavi MM, Fernandez DM, Giannacopoulos D (2011) Enhancing the performance of conjugate gradient solvers on graphic processing units. IEEE Trans Magn 47(5):1162---1165

[2]

Filipoviă¿ J, Madzin M, Fousek J, Matyska L (2013) Optimizing cuda code by kernel fusion--application on BLAS. CoRR abs/1305.1183

[3]

Gaikwad A, Toke IM (2010) Parallel iterative linear solvers on GPU: a financial engineering case. In: Proceediongs of PDP, pp 607---614

Digital Library

[4]

Garcia N (2010) Parallel power flow solutions using a biconjugate gradient algorithm and a newton method: a GPU-based approach. In: IEEE Power and Energy Society general meeting, pp 1---4

[5]

Golub GH, van Van Loan CF (1996) Matrix computations (Johns Hopkins studies in mathematical sciences), 3rd edn. The Johns Hopkins University Press. Baltimore, MD

[6]

Haidar A, Ltaief H, Luszczek P, Dongarra J (2012) A comprehensive study of task coalescing for selecting parallelism granularity in a two-stage bidiagonal reduction. In: Proceedings of of IEEE IPDPS, pp 25---35

Digital Library

[7]

Hwu W (2011) Computing Gems Jade Edition. Applications of GPU computing series, Jade edn. Elsevier Science, Amsterdam

[8]

Lanczos C (1952) Solution of systems of linear equations by minimized iterations. J Res Natl Bur Stand 49:33---53

[9]

Lawson CL, Hanson RJ, Kincaid DR, Krogh FT (1979) Basic linear algebra subprograms for fortran usage. ACM Trans Math Softw 5(3):308---323

Digital Library

[10]

Navarro AG, Asenjo R, Tabik S, Cascaval C (2009) Analytical modeling of pipeline parallelism. In: Proceedings of PACT, pp 281---290. IEEE Computer Society

Digital Library

[11]

NVIDIA (2013) Du-06702-001\_v5.5 CUBLAS user guide. Technical report. http://docs.nvidia.com/cuda/pdf/CUBLAS_Library.pdf

[12]

NVIDIA (2013) Du-06709-001\_v5.5 CUSPARSE library. Technical report. http://docs.nvidia.com/cuda/pdf/CUSPARSE_Library.pdf

[13]

Ortega G, Garzón EM, Vázquez F, García I (2013) The biconjugate gradient method on GPUs. J Supercomput 64:49---58

Digital Library

[14]

Vázquez F, Fernández JJ, Garzón EM (2012) Automatic tuning of the sparse matrix vector product on GPUs based on the ELLR-T approach. Parallel Comput 38:408---420

Digital Library

[15]

Vázquez F, Ortega G, Fernández JJ, Garzón EM (2010) Improving the performance of the sparse matrix vector product with GPUs. In: Proceedings of IEEE CIT, pp 1146---1151. IEEE Computer Society

Digital Library

[16]

Wozniak M, Olas T, Wyrzykowski R (2010) Parallel implementation of conjugate gradient method on graphics processors. In: Parallel processing and applied mathematics, LNCS vol 6067, pp 125---135

[17]

Wu H, Diamos G, Wang J, Cadambi S, Yalamanchili S, Chakradhar S (2012) Optimizing data warehousing applications for GPUs using kernel fusion/fission. In: Proceedings of IEEE IPDPSW, pp 2433---2442

Digital Library

Cited By

Hijma PHeldens SSclocco Avan Werkhoven BBal H(2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3570638
Cobb BKolla HPhipps EÇatalyürek ÜRobinson T(2022)FIST-HOSVDProceedings of the Platform for Advanced Scientific Computing Conference10.1145/3539781.3539798(1-11)Online publication date: 27-Jun-2022
https://dl.acm.org/doi/10.1145/3539781.3539798
Seznec MGac NOrieux FNaik A(2022)Real-time optical flow processing on embedded GPU: an hardware-aware algorithm to implementation strategyJournal of Real-Time Image Processing10.1007/s11554-021-01187-819:2(317-329)Online publication date: 1-Apr-2022
https://dl.acm.org/doi/10.1007/s11554-021-01187-8
Show More Cited By

Recommendations

A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs
FPGA '14: Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Sparse Matrix-Vector Multiplication (SpMxV) is a widely used mathematical operation in many high-performance scientific and engineering applications. In recent years, tuned software libraries for multi-core microprocessors (CPUs) and graphics processing ...
Pipelined Iterative Solvers with Kernel Fusion for Graphics Processing Units

We revisit the implementation of iterative solvers on discrete graphics processing units and demonstrate the benefit of implementations using extensive kernel fusion for pipelined formulations over conventional implementations of classical formulations. ...
Automated kernel fusion for GPU based on code motion
LCTES 2022: Proceedings of the 23rd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems

Applications implemented for GPU are important in various fields. GPU has many parallel computing cores and high arithmetic throughput, enabling GPU applications to work efficiently. However, the throughput of GPU memory, of which global memory is the ...

Comments

Information & Contributors

Information

Published In

cover image The Journal of Supercomputing

The Journal of Supercomputing Volume 70, Issue 2

November 2014

512 pages

ISSN:0920-8542

Issue’s Table of Contents

Copyright © Copyright © 2014 Springer Science+Business Media New York.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 November 2014

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hijma PHeldens SSclocco Avan Werkhoven BBal H(2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3570638
Cobb BKolla HPhipps EÇatalyürek ÜRobinson T(2022)FIST-HOSVDProceedings of the Platform for Advanced Scientific Computing Conference10.1145/3539781.3539798(1-11)Online publication date: 27-Jun-2022
https://dl.acm.org/doi/10.1145/3539781.3539798
Seznec MGac NOrieux FNaik A(2022)Real-time optical flow processing on embedded GPU: an hardware-aware algorithm to implementation strategyJournal of Real-Time Image Processing10.1007/s11554-021-01187-819:2(317-329)Online publication date: 1-Apr-2022
https://dl.acm.org/doi/10.1007/s11554-021-01187-8
Aliaga JDufrechou EEzzatti PQuintana-Ortí E(2019)Accelerating the task/data-parallel version of ILUPACK’s BiCG in multi-CPU/GPU configurationsParallel Computing10.1016/j.parco.2019.02.00585:C(79-87)Online publication date: 1-Jul-2019
https://dl.acm.org/doi/10.1016/j.parco.2019.02.005
Anzt HKreutzer MPonce EPeterson GWellein GDongarra J(2018)Optimization and performance evaluation of the IDR iterative Krylov solver on GPUsInternational Journal of High Performance Computing Applications10.1177/109434201664684432:2(220-230)Online publication date: 1-Mar-2018
https://dl.acm.org/doi/10.1177/1094342016646844
Tabik SPeemen MRomero L(2018)A tuning approach for iterative multiple 3d stencil pipeline on GPUsThe Journal of Supercomputing10.1007/s11227-017-2184-674:4(1580-1608)Online publication date: 1-Apr-2018
https://dl.acm.org/doi/10.1007/s11227-017-2184-6
Altaf MWood D(2017)LogCAACM SIGARCH Computer Architecture News10.1145/3140659.308021645:2(375-388)Online publication date: 24-Jun-2017
https://dl.acm.org/doi/10.1145/3140659.3080216
Altaf MWood D(2017)LogCAProceedings of the 44th Annual International Symposium on Computer Architecture10.1145/3079856.3080216(375-388)Online publication date: 24-Jun-2017
https://dl.acm.org/doi/10.1145/3079856.3080216
Kreutzer MThies JRöhrig-Zöllner MPieper AShahzad FGalgon MBasermann AFehske HHager GWellein G(2017)GHOSTInternational Journal of Parallel Programming10.1007/s10766-016-0464-z45:5(1046-1072)Online publication date: 1-Oct-2017
https://dl.acm.org/doi/10.1007/s10766-016-0464-z
Anzt HPonce EPeterson GDongarra J(2015)GPU-accelerated co-design of induced dimension reductionProceedings of the 2nd International Workshop on Hardware-Software Co-Design for High Performance Computing10.1145/2834899.2834907(1-8)Online publication date: 15-Nov-2015
https://dl.acm.org/doi/10.1145/2834899.2834907
Show More Cited By

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents