Article

Systematic approach in optimizing numerical memory-bound kernels on GPU

Authors:

Ahmad Abdelfattah,

Hatem LtaiefAuthors Info & Claims

Euro-Par'12: Proceedings of the 18th international conference on Parallel processing workshops

Pages 207 - 216

https://doi.org/10.1007/978-3-642-36949-0_23

Published: 27 August 2012 Publication History

Abstract

The use of GPUs has been very beneficial in accelerating dense linear algebra computational kernels (DLA). Many high performance numerical libraries like CUBLAS, MAGMA, and CULA provide BLAS and LAPACK implementations on GPUs as well as hybrid computations involving both, CPUs and GPUs. GPUs usually score better performance than CPUs for compute-bound operations, especially those characterized by a regular data access pattern. This paper highlights a systematic approach for efficiently implementing memory-bound DLA kernels on GPUs, by taking advantage of the underlying device's architecture (e.g., high throughput). This methodology proved to outperform existing state-of-the-art GPU implementations for the symmetric matrix-vector multiplication (SYMV), characterized by an irregular data access pattern, in a recent work (Abdelfattah et. al, VECPAR 2012). We propose to extend this methodology to the general matrix-vector multiplication (GEMV) kernel. The performance results show that our GEMV implementation achieves better performance for relatively small to medium matrix sizes, making it very influential in calculating the Hessenberg and bidiagonal reductions of general matrices (radar applications), which are the first step toward computing eigenvalues and singular values, respectively. Considering small and medium size matrices (≤4500), our GEMV kernel achieves an average 60% improvement in single precision (SP) and an average 25% in double precision (DP) over existing open-source and commercial software solutions. These results improve reduction algorithms for both small and large matrices. The improved GEMV performances engender an averge 30% (SP) and 15% (DP) in Hessenberg reduction and up to 25% (SP) and 14% (DP) improvement for the bidiagonal reduction over the implementation provided by CUBLAS 5.0.

References

[1]

CULA Dense Free Edition, http://www.culatools.com/

[2]

Matrix Algebra on GPU and Multicore Architectures. Innovative Computing Laboratory, University of Tennessee, http://icl.cs.utk.edu/magma/

[3]

NVIDIA CUDA Toolkit, http://developer.nvidia.com/cuda-toolkit

[4]

Nvidia visual profiler, http://developer.nvidia.com/nvidia-visual-profiler

[5]

Performance Application Programming Interface (PAPI). Innovative Computing Laboratory, University of Tennessee, http://icl.cs.utk.edu/papi/

[6]

The NVIDIA CUDA Basic Linear Algebra Subroutines (CUBLAS), http://developer.nvidia.com/cublas

[7]

Abdelfattah, A., Dongarra, J., Keyes, D., Ltaief, H.: Optimizing Memory-Bound SYMV Kernel on GPU Hardware Accelerators. In: The 10th International Meeting on High Performance Computing for Computational Science, VECPAR 2012 (accepted, 2012)

[8]

Humphrey, J. R., Price, D. K., Spagnoli, K. E., Paolini, A. L., Kelmelis, E. J.: CULA: Hybrid GPU Accelerated Linear Algebra Routines. In: Proceedings of SPIE Defense and Security Symposium, DSS (April 2010)

[9]

Kurzak, J., Tomov, S., Dongarra, J.: Autotuning GEMM Kernels for the Fermi GPU. IEEE Transactions on Parallel and Distributed Systems PP(99), 1 (2012)

Digital Library

[10]

Kurzak, J., Luszczek, P., Tomov, S., Dongarra, J.: Preliminary Results of Autotuning GEMM Kernels for the NVIDIA Kepler Architecture - GeForce GTX 680. LAPACK Working Note 267

[11]

Kwon, Y., Narayanan, R. M., Rangaswamy, M.: A multi-target detector using mutual information for noise radar systems in low snr regimes. In: 2010 International Waveform Diversity and Design Conference, WDD, pp. 000105-000109 (August 2010)

[12]

Nath, R., Tomov, S., Dong, T., Dongarra, J.: Optimizing symmetric dense matrix-vector multiplication on GPUs. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011, pp. 6:1-6:10. ACM, New York (2011)

Digital Library

[13]

Nath, R., Tomov, S., Dongarra, J.: An Improved Magma Gemm for Fermi Graphics Processing Units. Int. J. High Perform. Comput. Appl. 24(4), 511-515 (2010)

Digital Library

[14]

Volkov, V., Demmel, J. W.: Benchmarking GPUs to tune dense linear algebra. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC 2008, pp. 31:1-31:11. IEEE Press, Piscataway (2008)

Digital Library

[15]

Yu, W.C., Quan, W. D.: On the signal processing in the life-detection radar using an fmcw waveform. In: 2010 Third International Symposium on Information Processing, ISIP, pp. 213-216 (October 2010)

Digital Library

Cited By

Abdelfattah AKeyes DLtaief H(2016)KBLASACM Transactions on Mathematical Software10.1145/281831142:3(1-31)Online publication date: 10-May-2016
https://dl.acm.org/doi/10.1145/2818311

Recommendations

Optimizing symmetric dense matrix-vector multiplication on GPUs
SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

GPUs are excellent accelerators for data parallel applications with regular data access patterns. It is challenging, however, to optimize computations with irregular data access patterns on GPUs. One such computation is the Symmetric Matrix Vector ...
Acceleration of Hessenberg Reduction for Nonsymmetric Eigenvalue Problems Using GPU
ICNC '10: Proceedings of the 2010 First International Conference on Networking and Computing

Solution of large-scale dense nonsymmetric eigenvalue problem is required in many areas of scientific and engineering computing, such as vibration analysis of automobiles and analysis of electronic diffraction patterns. In this study, we focus on the ...
Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing

We present a Hessenberg reduction (HR) algorithm for hybrid systems of homogeneous multicore with GPU accelerators that can exceed 25x the performance of the corresponding LAPACK algorithm running on current homogeneous multicores. This enormous ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Euro-Par'12: Proceedings of the 18th international conference on Parallel processing workshops

August 2012

586 pages

ISBN:9783642369483

Editors:
Ioannis Caragiannis
Computer Technology Institute and Press "Diophantus" & Department of Computer Engineering and Informatics, University of Patras, Rio, Greece
,
Michael Alexander
Computer Technology Institute and Press "Diophantus" & Department of Computer Engineering and Informatics, Technische Universität Wien, Rio, Austria
,
Rosa Maria Badia
Artificial Intelligence Research Institute (IIIA), Spanish National Research Council (CSIC), Rio, Spain
,
Mario Cannataro
Department of Medical and Surgical Sciences, Bioinformatics Laboratory, University Magna Græcia of Catanzaro, Catanzaro, Italy
,
Alexandru Costan
Department of Medical and Surgical Sciences, Bioinformatics Laboratory, Inria Rennes, Catanzaro, France

Sponsors

Computer Tech Inst.: Computer Technology Institute

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 27 August 2012

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Abdelfattah AKeyes DLtaief H(2016)KBLASACM Transactions on Mathematical Software10.1145/281831142:3(1-31)Online publication date: 10-May-2016
https://dl.acm.org/doi/10.1145/2818311

View Options

View options

Media

Figures

Other

Tables

View Table of Contents