Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Communication-optimal Parallel and Sequential QR and LU Factorizations

Published: 01 February 2012 Publication History
  • Get Citation Alerts
  • Abstract

    We present parallel and sequential dense QR factorization algorithms that are both optimal (up to polylogarithmic factors) in the amount of communication they perform and just as stable as Householder QR. We prove optimality by deriving new lower bounds for the number of multiplications done by “non-Strassen-like” QR, and using these in known communication lower bounds that are proportional to the number of multiplications. We not only show that our QR algorithms attain these lower bounds (up to polylogarithmic factors), but that existing LAPACK and ScaLAPACK algorithms perform asymptotically more communication. We derive analogous communication lower bounds for LU factorization and point out recent LU algorithms in the literature that attain at least some of these lower bounds. The sequential and parallel QR algorithms for tall and skinny matrices lead to significant speedups in practice over some of the existing algorithms, including LAPACK and ScaLAPACK, for example, up to 6.7 times over ScaLAPACK. A performance model for the parallel algorithm for general rectangular matrices predicts significant speedups over ScaLAPACK.

    References

    [1]
    E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Blackford, and D. Sorensen, LAPACK Users' Guide, 3rd ed., SIAM, Philadelphia, 1999.
    [2]
    M. Baboulin, L. Giraud, S. Gratton, and J. Langou, Parallel Tools for Solving Incremental Dense Least Squares Problems. Application to Space Geodesy, Technical report UT-CS-06-582, University of Tennessee, Knoxville, 2006.
    [3]
    J. Baglama, D. Calvetti, and L. Reichel, Algorithm 827: Irbleigs: A MATLAB program for computing a few eigenpairs of a large sparse Hermitian matrix, ACM Trans. Math. Software, 29 (2003), pp. 337-348.
    [4]
    Z. Bai and D. Day, Block Arnoldi method, in Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide, Z. Bai, J. W. Demmel, J. J. Dongarra, A. Ruhe, and H. van der Vorst, eds., SIAM, Philadelphia, 2000, pp. 196-204.
    [5]
    C. G. Baker, U. L. Hetmaniuk, R. B. Lehoucq, and H. K. Thornquist, Anasazi webpage, http://trilinos.sandia.gov/packages/anasazi.
    [6]
    G. Ballard, J. Demmel, O. Holtz, and O. Schwartz, Minimizing Communication in Numerical Linear Algebra. Technical report UCB/EECS-2011-15, University of California, Berkeley, CA, 2011.
    [7]
    G. Ballard, J. Demmel, O. Holtz, and O. Schwartz, Minimizing communication in numerical linear algebra, SIAM J. Matrix Anal. Appl., 32 (2011), pp. 866-901.
    [8]
    L. S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. W. Demmel, I. Dhillon, J. J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley, ScaLAPACK Users' Guide, SIAM, Philadelphia, 1997.
    [9]
    A. Buttari, J. Langou, J. Kurzak, and J. J. Dongarra, A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures, Technical report UT-CS-07-600, University of Tennessee, Knoxville, 2007.
    [10]
    A. Buttari, J. Langou, J. Kurzak, and J. J. Dongarra, Parallel tiled QR factorization for multicore architectures, Technical report UT-CS-07-598, University of Tennessee, Knoxville, 2007.
    [11]
    M. Cosnard, J.-M. Muller, and Y. Robert, Parallel QR decomposition of a rectangular matrix, Numer. Math., 48 (1986), pp. 239-249.
    [12]
    L. Csanky, Fast parallel matrix inversion algorithms, SIAM J. Comput., 5 (1976), pp. 618-623.
    [13]
    R. D. D. Cunha, D. Becker, and J. C. Patterson, New parallel (rank-revealing) QR factorization algorithms, in Proceedings of the Euro-Par 2002. Parallel Processing: Eighth International Euro-Par Conference, Paderborn, Germany, 2002.
    [14]
    E. F. D'Azevedo and J. J. Dongarra, The Design and Implementation of the Parallel Out-of-Core ScaLAPACK LU, QR, and Cholesky Factorization Routines, Technical report 118 CS-97-247, University of Tennessee, Knoxville, 1997.
    [15]
    E. D'Azevedo and J. Dongarra, The design and implementation of the parallel out-of-core ScaLAPACK LU, QR, and Cholesky factorization routines, Concurrency Practice Experience, 12 (2000), pp. 1481-1483.
    [16]
    J. W. Demmel, L. Grigori, M. Hoemmen, and J. Langou, Communication-Avoiding Parallel and Sequential QR and LU Factorizations: Theory and Practice, Technical report UCB/EECS-2008-89, University of California, Berkeley, CA, 2008.
    [17]
    J. W. Demmel, M. Hoemmen, Y. Hida, and E. J. Riedy, Nonnegative diagonals and high performance on low-profile matrices from Householder QR, SIAM J. Sci. Comput., 31 (2009), pp. 2832-2841.
    [18]
    J. W. Demmel, M. Hoemmen, M. Mohiyuddin, and K. A. Yelick, Minimizing communication in sparse matrix solvers, in Proceedings of the 2009 ACM/IEEE Conference on Supercomputing, New York, 2009.
    [19]
    J. W. Demmel, Trading Off Parallelism and Numerical Stability, Technical report UT-CS-92-179, University of Tennessee, Knoxville, 1992.
    [20]
    E. Elmroth and F. Gustavson, New serial and parallel recursive QR factorization algorithms for SMP systems, in Proceedings of the Fourth International Workshop on AppliedParallel Computing, Large Scale Scientific and Industrial Problems, B. Kågström, E. Elmroth, J. Dongarra, and J. Wasniewski, eds., Lecture Notes in Comput. Sci. 1541, Springer, New York, 1998, pp. 120-128.
    [21]
    E. Elmroth and F. Gustavson, Applying recursion to serial and parallel QR factorization leads to better performance, IBM J. Res. Develop., 44 (2000), pp. 605-624.
    [22]
    J. D. Frens and D. S. Wise, QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism, SIGPLAN Not., 38 (2003), pp. 144-154.
    [23]
    R. W. Freund and M. Malhotra, A block QMR algorithm for non-Hermitian linear systems with multiple right-hand sides, Linear Algebra Appl., 254 (1997), pp. 119-157.
    [24]
    G. H. Golub, R. J. Plemmons, and A. Sameh, Parallel block schemes for large-scale least-squares computations, in High-Speed Computing: Scientific Applications and Algorithm Design, R. B. Wilhelmson, ed., University of Illinois Press, Chicago, IL, 1988, pp. 171-179.
    [25]
    S. L. Graham, M. Snir, and C. A. Patterson, eds., Getting Up to Speed: The Future of Supercomputing, National Academies Press, Washington, D.C., 2005.
    [26]
    L. Grigori, J. W. Demmel, and H. Xiang, Communication avoiding Gaussian elimination, Proceedings of the ACM/IEEE SC08 Conference, 2008.
    [27]
    L. Grigori, J. W. Demmel, and H. Xiang, CALU: A Communication Optimal LU Factorization Algorithm, Technical report UCB-EECS-2010-29, INRIA, 2010.
    [28]
    B. C. Gunter and R. A. van de Geijn, Parallel out-of-core computation and updating of the QR factorization, ACM Trans. Math. Software, 31 (2005), pp. 60-78.
    [29]
    U. Hetmaniuk and R. Lehoucq, Basis selection in LOBPCG, J. Comput. Phys., 218 (2006), pp. 324-332.
    [30]
    M. Hoemmen, Communication-Avoiding Krylov Subspace Methods, Ph.D. thesis, EECS Department, University of California, Berkeley, CA, 2010.
    [31]
    J. W. Hong and H. T. Kung, I/O complexity: The red-blue pebble game, in STOC '81: Proceedings of the 13th Annual ACM Symposium on Theory of Computing, New York, 1981, pp. 326-333.
    [32]
    D. Irony, S. Toledo, and A. Tiskin, Communication lower bounds for distributed-memory matrix multiplication, J. Parallel Distrib. Comput., 64 (2004), pp. 1017-1026.
    [33]
    A. V. Knyazev, M. Argentati, I. Lashuk, and E. E. Ovtchinnikov, Block Locally Optimal Preconditioned Eigenvalue Xolvers (BLOPEX) in HYPRE and PETSc, Technical report UCDHSC-CCM-251P, University of California, Davis, 2007.
    [34]
    A. V. Knyazev, BLOPEX, http://www-math.cudenver.edu/~aknyazev/software/BLOPEX.
    [35]
    J. Kurzak and J. J. Dongarra, QR Factorization for the CELL Processor, Technical report UT-CS-08-616, University of Tennessee, Knoxville, 2008.
    [36]
    R. Lehoucq and K. Maschhoff, Block Arnoldi method, in Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide, Z. Bai, J. W. Demmel, J. J. Dongarra, A. Ruhe, and H. van der Vorst, eds., SIAM, Philadelphia, 2000, pp. 185-187.
    [37]
    M. Leoncini, G. Manzini, and L. Margara, Parallel complexity of numerically accurate linear system solvers, SIAM J. Comput., 28 (1999), pp. 2030-2058.
    [38]
    O. Marques, BLZPACK, http://crd.lbl.gov/~osni.
    [39]
    R. Nishtala, G. Almási, and C. Caşcaval, Performance without pain = productivity: Data layout and collective communication in UPC, in Proceedings of the ACM SIGPLAN 2008 Symposium on Principles and Practice of Parallel Programming, 2008.
    [40]
    D. P. O'Leary, The block conjugate gradient algorithm and related methods, Linear Algebra Appl., 29 (1980), pp. 293-322.
    [41]
    A. Pothen and P. Raghavan, Distributed orthogonal factorization: Givens and Householder algorithms, SIAM J. Sci. Statist. Comput., 10 (1989), pp. 1113-1134.
    [42]
    G. Quintana-Ortí, E. S. Quintana-Ortí, E. Chan, F. G. V. Zee, and R. A. van de Geijn, Scheduling of QR factorization algorithms on SMP and multi-core architectures, in Proceedings of the 16th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, Toulouse, France, 2008.
    [43]
    E. Rabani and S. Toledo, Out-of-core SVD and QR decompositions, in Proceedings of the 10th SIAM Conference on Parallel Processing for Scientific Computing, Norfolk, VA, 2001.
    [44]
    R. Raz, On the complexity of matrix product, SIAM J. Comput., 32 (2003), pp. 1356-1369.
    [45]
    R. Schreiber and C. Van Loan, A storage efficient $WY$ representation for products of Householder transformations, SIAM J. Sci. Statist. Comput., 10 (1989), pp. 53-57.
    [46]
    A. Stathopoulos, PRIMME, http://www.cs.wm.edu/~andreas/software.
    [47]
    S. Toledo, Locality of reference in LU decomposition with partial pivoting, SIAM J. Matrix Anal. Appl., 18 (1997), pp. 1065-1081.
    [48]
    B. Vital, Étude de quelques méthodes de résolution de problèmes linéaires de grande taille sur multiprocesseur, Ph.D. thesis, Université de Rennes I, Rennes, France, 1990.
    [49]
    K. Wu and H. D. Simon, TRLAN, http://crd.lbl.gov/~kewu/ps/trlan_.html.

    Cited By

    View all
    • (2024)A LAPACK Implementation of the Dynamic Mode DecompositionACM Transactions on Mathematical Software10.1145/364001250:1(1-32)Online publication date: 19-Jan-2024
    • (2024)Tightening I/O Lower Bounds through the Hourglass Dependency PatternProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659986(183-193)Online publication date: 17-Jun-2024
    • (2024)A Distributed Block Chebyshev-Davidson Algorithm for Parallel Spectral ClusteringJournal of Scientific Computing10.1007/s10915-024-02455-y98:3Online publication date: 17-Feb-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image SIAM Journal on Scientific Computing
    SIAM Journal on Scientific Computing  Volume 34, Issue 1
    2012
    694 pages

    Publisher

    Society for Industrial and Applied Mathematics

    United States

    Publication History

    Published: 01 February 2012

    Author Tags

    1. LU factorization
    2. QR factorization
    3. linear algebra

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 10 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A LAPACK Implementation of the Dynamic Mode DecompositionACM Transactions on Mathematical Software10.1145/364001250:1(1-32)Online publication date: 19-Jan-2024
    • (2024)Tightening I/O Lower Bounds through the Hourglass Dependency PatternProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659986(183-193)Online publication date: 17-Jun-2024
    • (2024)A Distributed Block Chebyshev-Davidson Algorithm for Parallel Spectral ClusteringJournal of Scientific Computing10.1007/s10915-024-02455-y98:3Online publication date: 17-Feb-2024
    • (2024)RPH-PGD: Randomly Projected Hessian for Perturbed Gradient DescentAdvances in Knowledge Discovery and Data Mining10.1007/978-981-97-2253-2_20(248-259)Online publication date: 7-May-2024
    • (2023)Block subsampled randomized hadamard transform for Nyström approximation on distributed architecturesProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3618474(1564-1576)Online publication date: 23-Jul-2023
    • (2023)A fast adaptive randomized PCA algorithmProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/411(3695-3704)Online publication date: 19-Aug-2023
    • (2023)Fast truncated SVD of sparse and dense matrices on graphics processorsInternational Journal of High Performance Computing Applications10.1177/1094342023117969937:3-4(380-393)Online publication date: 1-Jul-2023
    • (2023)Orthogonal Layers of Parallelism in Large-Scale Eigenvalue ComputationsACM Transactions on Parallel Computing10.1145/361444410:3(1-31)Online publication date: 22-Sep-2023
    • (2023)Task-based Parallel Programming for Scalable Matrix Product AlgorithmsACM Transactions on Mathematical Software10.1145/358356049:2(1-23)Online publication date: 15-Jun-2023
    • (2023)Automatic Generation of Distributed-Memory Mappings for Tensor ComputationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607096(1-13)Online publication date: 12-Nov-2023
    • Show More Cited By

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media