article

Communication-optimal Parallel and Sequential QR and LU Factorizations

Authors:

Julien LangouAuthors Info & Claims

SIAM Journal on Scientific Computing, Volume 34, Issue 1

Pages 206 - 239

https://doi.org/10.1137/080731992

Published: 01 February 2012 Publication History

Abstract

We present parallel and sequential dense QR factorization algorithms that are both optimal (up to polylogarithmic factors) in the amount of communication they perform and just as stable as Householder QR. We prove optimality by deriving new lower bounds for the number of multiplications done by “non-Strassen-like” QR, and using these in known communication lower bounds that are proportional to the number of multiplications. We not only show that our QR algorithms attain these lower bounds (up to polylogarithmic factors), but that existing LAPACK and ScaLAPACK algorithms perform asymptotically more communication. We derive analogous communication lower bounds for LU factorization and point out recent LU algorithms in the literature that attain at least some of these lower bounds. The sequential and parallel QR algorithms for tall and skinny matrices lead to significant speedups in practice over some of the existing algorithms, including LAPACK and ScaLAPACK, for example, up to 6.7 times over ScaLAPACK. A performance model for the parallel algorithm for general rectangular matrices predicts significant speedups over ScaLAPACK.

References

[1]

E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Blackford, and D. Sorensen, LAPACK Users' Guide, 3rd ed., SIAM, Philadelphia, 1999.

[2]

M. Baboulin, L. Giraud, S. Gratton, and J. Langou, Parallel Tools for Solving Incremental Dense Least Squares Problems. Application to Space Geodesy, Technical report UT-CS-06-582, University of Tennessee, Knoxville, 2006.

[3]

J. Baglama, D. Calvetti, and L. Reichel, Algorithm 827: Irbleigs: A MATLAB program for computing a few eigenpairs of a large sparse Hermitian matrix, ACM Trans. Math. Software, 29 (2003), pp. 337-348.

[4]

Z. Bai and D. Day, Block Arnoldi method, in Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide, Z. Bai, J. W. Demmel, J. J. Dongarra, A. Ruhe, and H. van der Vorst, eds., SIAM, Philadelphia, 2000, pp. 196-204.

[5]

C. G. Baker, U. L. Hetmaniuk, R. B. Lehoucq, and H. K. Thornquist, Anasazi webpage, http://trilinos.sandia.gov/packages/anasazi.

[6]

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz, Minimizing Communication in Numerical Linear Algebra. Technical report UCB/EECS-2011-15, University of California, Berkeley, CA, 2011.

[7]

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz, Minimizing communication in numerical linear algebra, SIAM J. Matrix Anal. Appl., 32 (2011), pp. 866-901.

[8]

L. S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. W. Demmel, I. Dhillon, J. J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley, ScaLAPACK Users' Guide, SIAM, Philadelphia, 1997.

[9]

A. Buttari, J. Langou, J. Kurzak, and J. J. Dongarra, A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures, Technical report UT-CS-07-600, University of Tennessee, Knoxville, 2007.

[10]

A. Buttari, J. Langou, J. Kurzak, and J. J. Dongarra, Parallel tiled QR factorization for multicore architectures, Technical report UT-CS-07-598, University of Tennessee, Knoxville, 2007.

[11]

M. Cosnard, J.-M. Muller, and Y. Robert, Parallel QR decomposition of a rectangular matrix, Numer. Math., 48 (1986), pp. 239-249.

[12]

L. Csanky, Fast parallel matrix inversion algorithms, SIAM J. Comput., 5 (1976), pp. 618-623.

[13]

R. D. D. Cunha, D. Becker, and J. C. Patterson, New parallel (rank-revealing) QR factorization algorithms, in Proceedings of the Euro-Par 2002. Parallel Processing: Eighth International Euro-Par Conference, Paderborn, Germany, 2002.

[14]

E. F. D'Azevedo and J. J. Dongarra, The Design and Implementation of the Parallel Out-of-Core ScaLAPACK LU, QR, and Cholesky Factorization Routines, Technical report 118 CS-97-247, University of Tennessee, Knoxville, 1997.

[15]

E. D'Azevedo and J. Dongarra, The design and implementation of the parallel out-of-core ScaLAPACK LU, QR, and Cholesky factorization routines, Concurrency Practice Experience, 12 (2000), pp. 1481-1483.

[16]

J. W. Demmel, L. Grigori, M. Hoemmen, and J. Langou, Communication-Avoiding Parallel and Sequential QR and LU Factorizations: Theory and Practice, Technical report UCB/EECS-2008-89, University of California, Berkeley, CA, 2008.

[17]

J. W. Demmel, M. Hoemmen, Y. Hida, and E. J. Riedy, Nonnegative diagonals and high performance on low-profile matrices from Householder QR, SIAM J. Sci. Comput., 31 (2009), pp. 2832-2841.

[18]

J. W. Demmel, M. Hoemmen, M. Mohiyuddin, and K. A. Yelick, Minimizing communication in sparse matrix solvers, in Proceedings of the 2009 ACM/IEEE Conference on Supercomputing, New York, 2009.

[19]

J. W. Demmel, Trading Off Parallelism and Numerical Stability, Technical report UT-CS-92-179, University of Tennessee, Knoxville, 1992.

[20]

E. Elmroth and F. Gustavson, New serial and parallel recursive QR factorization algorithms for SMP systems, in Proceedings of the Fourth International Workshop on AppliedParallel Computing, Large Scale Scientific and Industrial Problems, B. Kågström, E. Elmroth, J. Dongarra, and J. Wasniewski, eds., Lecture Notes in Comput. Sci. 1541, Springer, New York, 1998, pp. 120-128.

[21]

E. Elmroth and F. Gustavson, Applying recursion to serial and parallel QR factorization leads to better performance, IBM J. Res. Develop., 44 (2000), pp. 605-624.

[22]

J. D. Frens and D. S. Wise, QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism, SIGPLAN Not., 38 (2003), pp. 144-154.

[23]

R. W. Freund and M. Malhotra, A block QMR algorithm for non-Hermitian linear systems with multiple right-hand sides, Linear Algebra Appl., 254 (1997), pp. 119-157.

[24]

G. H. Golub, R. J. Plemmons, and A. Sameh, Parallel block schemes for large-scale least-squares computations, in High-Speed Computing: Scientific Applications and Algorithm Design, R. B. Wilhelmson, ed., University of Illinois Press, Chicago, IL, 1988, pp. 171-179.

[25]

S. L. Graham, M. Snir, and C. A. Patterson, eds., Getting Up to Speed: The Future of Supercomputing, National Academies Press, Washington, D.C., 2005.

[26]

L. Grigori, J. W. Demmel, and H. Xiang, Communication avoiding Gaussian elimination, Proceedings of the ACM/IEEE SC08 Conference, 2008.

[27]

L. Grigori, J. W. Demmel, and H. Xiang, CALU: A Communication Optimal LU Factorization Algorithm, Technical report UCB-EECS-2010-29, INRIA, 2010.

[28]

B. C. Gunter and R. A. van de Geijn, Parallel out-of-core computation and updating of the QR factorization, ACM Trans. Math. Software, 31 (2005), pp. 60-78.

[29]

U. Hetmaniuk and R. Lehoucq, Basis selection in LOBPCG, J. Comput. Phys., 218 (2006), pp. 324-332.

[30]

M. Hoemmen, Communication-Avoiding Krylov Subspace Methods, Ph.D. thesis, EECS Department, University of California, Berkeley, CA, 2010.

[31]

J. W. Hong and H. T. Kung, I/O complexity: The red-blue pebble game, in STOC '81: Proceedings of the 13th Annual ACM Symposium on Theory of Computing, New York, 1981, pp. 326-333.

[32]

D. Irony, S. Toledo, and A. Tiskin, Communication lower bounds for distributed-memory matrix multiplication, J. Parallel Distrib. Comput., 64 (2004), pp. 1017-1026.

[33]

A. V. Knyazev, M. Argentati, I. Lashuk, and E. E. Ovtchinnikov, Block Locally Optimal Preconditioned Eigenvalue Xolvers (BLOPEX) in HYPRE and PETSc, Technical report UCDHSC-CCM-251P, University of California, Davis, 2007.

[34]

A. V. Knyazev, BLOPEX, http://www-math.cudenver.edu/~aknyazev/software/BLOPEX.

[35]

J. Kurzak and J. J. Dongarra, QR Factorization for the CELL Processor, Technical report UT-CS-08-616, University of Tennessee, Knoxville, 2008.

[36]

R. Lehoucq and K. Maschhoff, Block Arnoldi method, in Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide, Z. Bai, J. W. Demmel, J. J. Dongarra, A. Ruhe, and H. van der Vorst, eds., SIAM, Philadelphia, 2000, pp. 185-187.

[37]

M. Leoncini, G. Manzini, and L. Margara, Parallel complexity of numerically accurate linear system solvers, SIAM J. Comput., 28 (1999), pp. 2030-2058.

[38]

O. Marques, BLZPACK, http://crd.lbl.gov/~osni.

[39]

R. Nishtala, G. Almási, and C. Caşcaval, Performance without pain = productivity: Data layout and collective communication in UPC, in Proceedings of the ACM SIGPLAN 2008 Symposium on Principles and Practice of Parallel Programming, 2008.

[40]

D. P. O'Leary, The block conjugate gradient algorithm and related methods, Linear Algebra Appl., 29 (1980), pp. 293-322.

[41]

A. Pothen and P. Raghavan, Distributed orthogonal factorization: Givens and Householder algorithms, SIAM J. Sci. Statist. Comput., 10 (1989), pp. 1113-1134.

[42]

G. Quintana-Ortí, E. S. Quintana-Ortí, E. Chan, F. G. V. Zee, and R. A. van de Geijn, Scheduling of QR factorization algorithms on SMP and multi-core architectures, in Proceedings of the 16th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, Toulouse, France, 2008.

[43]

E. Rabani and S. Toledo, Out-of-core SVD and QR decompositions, in Proceedings of the 10th SIAM Conference on Parallel Processing for Scientific Computing, Norfolk, VA, 2001.

[44]

R. Raz, On the complexity of matrix product, SIAM J. Comput., 32 (2003), pp. 1356-1369.

[45]

R. Schreiber and C. Van Loan, A storage efficient $WY$ representation for products of Householder transformations, SIAM J. Sci. Statist. Comput., 10 (1989), pp. 53-57.

[46]

A. Stathopoulos, PRIMME, http://www.cs.wm.edu/~andreas/software.

[47]

S. Toledo, Locality of reference in LU decomposition with partial pivoting, SIAM J. Matrix Anal. Appl., 18 (1997), pp. 1065-1081.

[48]

B. Vital, Étude de quelques méthodes de résolution de problèmes linéaires de grande taille sur multiprocesseur, Ph.D. thesis, Université de Rennes I, Rennes, France, 1990.

[49]

K. Wu and H. D. Simon, TRLAN, http://crd.lbl.gov/~kewu/ps/trlan_.html.

Cited By

Drmač Z(2024)A LAPACK Implementation of the Dynamic Mode DecompositionACM Transactions on Mathematical Software10.1145/364001250:1(1-32)Online publication date: 19-Jan-2024
https://dl.acm.org/doi/10.1145/3640012
Eyraud-Dubois LIooss GLangou JRastello FAgrawal KPetrank E(2024)Tightening I/O Lower Bounds through the Hourglass Dependency PatternProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659986(183-193)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1145/3626183.3659986
Pang QYang H(2024)A Distributed Block Chebyshev-Davidson Algorithm for Parallel Spectral ClusteringJournal of Scientific Computing10.1007/s10915-024-02455-y98:3Online publication date: 17-Feb-2024
https://dl.acm.org/doi/10.1007/s10915-024-02455-y
Show More Cited By

Index Terms

Communication-optimal Parallel and Sequential QR and LU Factorizations

Index terms have been assigned to the content through auto-classification.

Recommendations

Computing rank-revealing QR factorizations of dense matrices

We develop algorithms and implementations for computing rank-revealing QR (RRQR) factorizations of dense matrices. First, we develop an efficient block algorithm for approximating an RRQR factorization, employing a windowed version of the commonly used ...
CALU: A Communication Optimal LU Factorization Algorithm

Since the cost of communication (moving data) greatly exceeds the cost of doing arithmetic on current and future computing platforms, we are motivated to devise algorithms that communicate as little as possible, even if they do slightly more arithmetic, ...
A rank-one reduction formula and its applications to matrix factorizations

Comments

Information & Contributors

Information

Published In

cover image SIAM Journal on Scientific Computing

SIAM Journal on Scientific Computing Volume 34, Issue 1

2012

694 pages

ISSN:1064-8275

Issue’s Table of Contents

Publisher

Society for Industrial and Applied Mathematics

United States

Publication History

Published: 01 February 2012

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

75
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Drmač Z(2024)A LAPACK Implementation of the Dynamic Mode DecompositionACM Transactions on Mathematical Software10.1145/364001250:1(1-32)Online publication date: 19-Jan-2024
https://dl.acm.org/doi/10.1145/3640012
Eyraud-Dubois LIooss GLangou JRastello FAgrawal KPetrank E(2024)Tightening I/O Lower Bounds through the Hourglass Dependency PatternProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659986(183-193)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1145/3626183.3659986
Pang QYang H(2024)A Distributed Block Chebyshev-Davidson Algorithm for Parallel Spectral ClusteringJournal of Scientific Computing10.1007/s10915-024-02455-y98:3Online publication date: 17-Feb-2024
https://dl.acm.org/doi/10.1007/s10915-024-02455-y
Li CHuang JHon WLee C(2024)RPH-PGD: Randomly Projected Hessian for Perturbed Gradient DescentAdvances in Knowledge Discovery and Data Mining10.1007/978-981-97-2253-2_20(248-259)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1007/978-981-97-2253-2_20
Balabanov OBeaupère MGrigori LLederer VKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)Block subsampled randomized hadamard transform for Nyström approximation on distributed architecturesProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3618474(1564-1576)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3618474
Feng XYu WElkind E(2023)A fast adaptive randomized PCA algorithmProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/411(3695-3704)Online publication date: 19-Aug-2023
https://dl.acm.org/doi/10.24963/ijcai.2023/411
Dongarra JTourancheau BTomás AQuintana-Orti EAnzt H(2023)Fast truncated SVD of sparse and dense matrices on graphics processorsInternational Journal of High Performance Computing Applications10.1177/1094342023117969937:3-4(380-393)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1177/10943420231179699
Alvermann AHager GFehske H(2023)Orthogonal Layers of Parallelism in Large-Scale Eigenvalue ComputationsACM Transactions on Parallel Computing10.1145/361444410:3(1-31)Online publication date: 22-Sep-2023
https://dl.acm.org/doi/10.1145/3614444
Agullo EButtari AGuermouche AHerrmann JJego A(2023)Task-based Parallel Programming for Scalable Matrix Product AlgorithmsACM Transactions on Mathematical Software10.1145/358356049:2(1-23)Online publication date: 15-Jun-2023
https://dl.acm.org/doi/10.1145/3583560
Kong MAbu Yosef RRountev ASadayappan PMohror KArnold DBadia R(2023)Automatic Generation of Distributed-Memory Mappings for Tensor ComputationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607096(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607096
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents