research-article

QR decomposition on GPUs

Authors:

Andrew Kerr,

Dan Campbell,

Mark RichardsAuthors Info & Claims

GPGPU-2: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units

Pages 71 - 78

https://doi.org/10.1145/1513895.1513904

Published: 08 March 2009 Publication History

Get Access

Abstract

QR decomposition is a computationally intensive linear algebra operation that factors a matrix A into the product of a unitary matrix Q and upper triangular matrix R. Adaptive systems commonly employ QR decomposition to solve overdetermined least squares problems. Performance of QR decomposition is typically the crucial factor limiting problem sizes.

Graphics Processing Units (GPUs) are high-performance processors capable of executing hundreds of floating point operations in parallel. As commodity accelerators for 3D graphics, GPUs offer tremendous computational performance at relatively low costs. While GPUs are favorable to applications with much inherent parallelism requiring coarse-grain synchronization between processors, methods for efficiently utilizing GPUs for algorithms computing QR decomposition remain elusive.

In this paper, we discuss the architectural characteristics of GPUs and explain how a high-performance implementation of QR decomposition may be implemented. We provide detailed performance analysis of the resulting implementation for real-valued matrices and offer recommendations for achieving high performance to future developers of dense linear algebra procedures for GPUs. Our implementation sustains 143 GFLOP/s, and we believe this is the fastest announced QR implementation executing entirely on the GPU.

References

[1]

NVIDIA Corporation, Santa Clara, California, NVIDIA CUDA Compute Unified Device Architecture, 2008.

Google Scholar

[2]

Khronos OpenCL Working Group, The OpenCL Specification, 2008.

Google Scholar

[3]

A. Kerr, D. Campbell, and M. Richards, GPU Performance Assessment with the HPEC Challenge, in HPEC Workshop 2008, Lexington, MA, 2008, MIT Lincoln Laboratory.

Google Scholar

[4]

R. Haney, T. Meuse, J. Kepner, and J. Lebak, HPEC Challenge Overview, MIT Lincoln Laboratory, 2005.

Google Scholar

[5]

G. Golub and C. V. Loan, Matrix Computations, Third ed. (Johns Hopkins University Press, Baltimore, MD., 1996).

Digital Library

Google Scholar

[6]

C. H. Bischof and C. V. Loan, The WY Representation for Products of Householder Matrices, Cornell University, Ithaca, NY, USA, 1985.

Google Scholar

[7]

D. Bindel, J. Demmel, W. Kahan, and O. Marques, On computing Givens rotations reliably and efficiently, ACM Trans. Math. Softw., New York, NY, USA, 2002.

Digital Library

Google Scholar

[8]

A. H. Sameh and D. J. Kuck, On Stable Parallel Linear System Solvers, Journal of the ACM, 1978.

Digital Library

Google Scholar

[9]

H. Hoffmann, Stream Algorithms and Architecture, Master's thesis, Massachusetts Institute of Technology, 2003.

Google Scholar

[10]

NVIDIA, CUDA CUBLAS Library, NVIDIA Corporation, Santa Clara, California, 2008.

Google Scholar

[11]

V. Volkov and J. W. Demmel, Benchmarking GPUs to Tune Dense Linear Algebra, in SC '08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, pp. 1--11, Piscataway, NJ, USA, 2008, IEEE Press.

Digital Library

Google Scholar

[12]

A. Kerr, D. Campbell, and M. Richards, GPU VSIPL, in HPEC Workshop 2008, Lexington, MA, 2008, MIT Lincoln Laboratory.

Google Scholar

[13]

M. Baboulin, J. Dongarra, and S. Tomov, Some Issues in Dense Linear Algebra for Multicore and Special Purpose Architectures, Technical Report UT-CS-08-200, University of Tennessee, 2008.

Google Scholar

[14]

D. A. Schwartz, R. R. Judd, W. J. Harrod, and D. P. Manley, VSIPL 1.3 API, VSIPL Forum, 2008.

Google Scholar

Cited By

View all

Duan ZWang CChen CHuang JQian WFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)Optimal Linear Subspace Search: Learning to Construct Fast and High-Quality Schedulers for Diffusion ModelsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614999(463-472)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3614999
Langhammer MConstantinides G(2023)eGPU: A 750 MHz Class Soft GPGPU for FPGA2023 33rd International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL60245.2023.00047(277-282)Online publication date: 4-Sep-2023
https://doi.org/10.1109/FPL60245.2023.00047
Miyashita LKimura YTabata SIshikawa M(2022)High-Speed Depth-Normal Measurement and Fusion Based on Multiband Sensing and Block ParallelizationJournal of Robotics and Mechatronics10.20965/jrm.2022.p111134:5(1111-1121)Online publication date: 20-Oct-2022
https://doi.org/10.20965/jrm.2022.p1111
Show More Cited By

QR decomposition on GPUs

Recommendations

Communication-Avoiding QR Decomposition for GPUs
IPDPS '11: Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium

We describe an implementation of the Communication-Avoiding QR (CAQR) factorization that runs entirely on a single graphics processor (GPU). We show that the reduction in memory traffic provided by CAQR allows us to outperform existing parallel GPU ...
LU Decomposition on GPUs: The Impact of Memory Access
SBAC-PADW '10: Proceedings of the 2010 22nd International Symposium on Computer Architecture and High Performance Computing Workshops

Graphics Processing Units (GPUs) are emerging as an attractive computing platform for general purpose computations due to their extremely high floating-point processing performance and their comparatively low cost. In the context of dense linear algebra,...
Compressed sensing and Cholesky decomposition on FPGAs and GPUs

Compressed sensing (CS) is a revolutionary signal acquisition theory, enabling signal acquisition at a rate that is below the Nyquist sampling rate. However, CS signal reconstruction algorithms are computationally expensive. One of the key computation ...

Comments

Information & Contributors

Information

Published In

GPGPU-2: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units

March 2009

107 pages

ISBN:9781605585178

DOI:10.1145/1513895

Conference Chairs:
David Kaeli
Northeastern University
,
Miriam Leeser
Northeastern University

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 March 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

GPGPU '09

GPGPU '09: Second Workshop on General-Purpose Computation on Graphics Processing Units

March 8, 2009

D.C., Washington, USA

Acceptance Rates

Overall Acceptance Rate 57 of 129 submissions, 44%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

36
Total Citations
View Citations
1,001
Total Downloads

Downloads (Last 12 months)30
Downloads (Last 6 weeks)5

Reflects downloads up to 23 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Duan ZWang CChen CHuang JQian WFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)Optimal Linear Subspace Search: Learning to Construct Fast and High-Quality Schedulers for Diffusion ModelsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614999(463-472)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3614999
Langhammer MConstantinides G(2023)eGPU: A 750 MHz Class Soft GPGPU for FPGA2023 33rd International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL60245.2023.00047(277-282)Online publication date: 4-Sep-2023
https://doi.org/10.1109/FPL60245.2023.00047
Miyashita LKimura YTabata SIshikawa M(2022)High-Speed Depth-Normal Measurement and Fusion Based on Multiband Sensing and Block ParallelizationJournal of Robotics and Mechatronics10.20965/jrm.2022.p111134:5(1111-1121)Online publication date: 20-Oct-2022
https://doi.org/10.20965/jrm.2022.p1111
Van Sang TYamaguchi RKobayashi RNakata T(2022)Evaluating the Performance Acceleration of Generalized Linear Solver using Normal Equation on Three Architectures for Tall Skinny Datasets2022 International Conference on Computational Science and Computational Intelligence (CSCI)10.1109/CSCI58124.2022.00028(134-139)Online publication date: Dec-2022
https://doi.org/10.1109/CSCI58124.2022.00028
Borbon JHuang JWong BNajjar W(2021)Acceleration of Parallel-Blocked QR Decomposition of Tall-and-Skinny Matrices on FPGAsACM Transactions on Architecture and Code Optimization10.1145/344777518:3(1-25)Online publication date: 10-May-2021
https://dl.acm.org/doi/10.1145/3447775
Venkata Siva Kumar KKopparthi VSabat SVarma.K TPeesapati R(2021)System on chip implementation of floating point matrix inversion using modified Gram-Schmidt based QR decomposition on PYNQ FPGA2021 IEEE International Symposium on Smart Electronic Systems (iSES) (Formerly iNiS)10.1109/iSES52644.2021.00030(84-88)Online publication date: Dec-2021
https://doi.org/10.1109/iSES52644.2021.00030
Katsikis VMourtas SStanimirović PZhang Y(2021)Continuous-Time Varying Complex QR Decomposition via Zeroing Neural DynamicsNeural Processing Letters10.1007/s11063-021-10566-yOnline publication date: 24-Jun-2021
https://doi.org/10.1007/s11063-021-10566-y
Nie QMalik S(2020)MemFlow: Memory-Driven Data Scheduling With Datapath Co-Design in Accelerators for Large-Scale Inference ApplicationsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2019.292537739:9(1875-1888)Online publication date: Sep-2020
https://doi.org/10.1109/TCAD.2019.2925377
Burger AUrban PBoubin JSchiele G(2020)An Architecture for Solving the Eigenvalue Problem on Embedded FPGAsArchitecture of Computing Systems – ARCS 202010.1007/978-3-030-52794-5_3(32-43)Online publication date: 9-Jul-2020
https://doi.org/10.1007/978-3-030-52794-5_3
Sang TKobayashi RYamaguchi RNakata T(2019)Accelerating Solution of Generalized Linear Models by Solving Normal Equation Using GPGPU on a Large Real-World Tall-Skinny Data Set2019 31st International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD.2019.00029(112-119)Online publication date: Oct-2019
https://doi.org/10.1109/SBAC-PAD.2019.00029
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Recommendations

Communication-Avoiding QR Decomposition for GPUs

LU Decomposition on GPUs: The Impact of Memory Access

Compressed sensing and Cholesky decomposition on FPGAs and GPUs