research-article

Public Access

Avoiding Communication in Successive Band Reduction

Authors:

James Demmel, and

Nicholas KnightAuthors Info & Claims

ACM Transactions on Parallel Computing (TOPC), Volume 1, Issue 2

Article No.: 11, Pages 1 - 37

https://doi.org/10.1145/2686877

Published: 18 February 2015 Publication History

Abstract

The running time of an algorithm depends on both arithmetic and communication (i.e., data movement) costs, and the relative costs of communication are growing over time. In this work, we present sequential and distributed-memory parallel algorithms for tridiagonalizing full symmetric and symmetric band matrices that asymptotically reduce communication compared to previous approaches.

The tridiagonalization of a symmetric band matrix is a key kernel in solving the symmetric eigenvalue problem for both full and band matrices. In order to preserve structure, tridiagonalization routines use annihilate-and-chase procedures that previously have suffered from poor data locality and high parallel latency cost. We improve both by reorganizing the computation and obtain asymptotic improvements. We also propose new algorithms for reducing a full symmetric matrix to band form in a communication-efficient manner. In this article, we consider the cases of computing eigenvalues only and of computing eigenvalues and all eigenvectors.

References

[1]

Aggarwal, A. and Vitter, J. 1988. The input/output complexity of sorting and related problems. Commun. ACM 31, 9, 1116--1127.

Digital Library

[2]

Agullo, E., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Langou, J., Ltaief, H., Luszczek, P., and Yarkhan, A. 2009. PLASMA Users' Guide. http://icl.cs.utk.edu/plasma/.

[3]

Anderson, E., Bai, Z., Bischof, C., et al. 1992. LAPACK Users' Guide. SIAM, Philadelphia, PA.

Digital Library

[4]

Auckenthaler, T. 2012. Highly scalable eigensolvers for petaflop applications. Ph.D. thesis, Fakultät für Informatik, Technische Universität München.

[5]

Auckenthaler, T., Blum, V., Bungartz, H.-J., Huckle, T., Johanni, R., Krämer, L., Lang, B., Lederer, H., and Willems, P. 2011a. Parallel solution of partial symmetric eigenvalue problems from electronic structure calculations. Parallel Comput. 37, 12, 783--794.

Digital Library

[6]

Auckenthaler, T., Bungartz, H.-J., Huckle, T., Krämer, L., Lang, B., Lederer, H., and Willems, P. 2011b. Developing algorithms and software for the parallel solution of the symmetric eigenvalue problem. J. Computat. Sci. 2, 3, 272--278.

[7]

Ballard, G., Demmel, J., and Dumitriu, I. 2011a. Communication-optimal parallel and sequential eigenvalue and singular value algorithms. EECS Tech. Rep. EECS-2011-14, University of California, Berkeley.

[8]

Ballard, G., Demmel, J., Grigori, L., Jacquelin, M., Nguyen, H., and Solomonik, E. 2013a. Reconstructing householder vectors from tall-skinny QR. Tech. Rep. UCB/EECS-2013-175, EECS Department, University of California, Berkeley.

[9]

Ballard, G., Demmel, J., Holtz, O., and Schwartz, O. 2011b. Minimizing communication in numerical linear algebra. SIAM J. Matrix Anal. Appl. 32, 3, 866--901.

[10]

Ballard, G., Demmel, J., and Knight, N. 2012. Communication avoiding successive band reduction. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'12). ACM, New York, 35--44.

Digital Library

[11]

Ballard, G., Demmel, J., Lipshitz, B., Schwartz, O., and Toledo, S. 2013b. Communication efficient Gaussian elimination with partial pivoting using a shape morphing data layout. In Proceedings of the 25th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA'13). ACM, 232--240.

Digital Library

[12]

Barth, W., Martin, R., and Wilkinson, J. 1967. Calculation of the eigenvalues of a symmetric tridiagonal matrix by the method of bisection. Numerische Mathematik 9, 5, 386--393.

Digital Library

[13]

Bientinesi, P., Igual, F., Kressner, D., and Quintana-Ort, E. 2010. Reduction to condensed forms for symmetric eigenvalue problems on multi-core architectures. In Parallel Processing and Applied Mathematics, R. Wyrzykowski, J. Dongarra, K. Karczewski, and J. Wasniewski, Eds., Lecture Notes in Computer Science, vol. 6067, Springer, 387--395.

Digital Library

[14]

Bischof, C., Lang, B., and Sun, X. 1994. Parallel tridiagonalization through two-step band reduction. In Proceedings of the Conference on Scalable High-Performance Computing. IEEE, 23--27.

[15]

Bischof, C., Lang, B., and Sun, X. 2000a. Algorithm 807: The SBR Toolbox: Software for successive band reduction. ACM Trans. Math. Softw. 26, 4, 602--616.

Digital Library

[16]

Bischof, C., Lang, B., and Sun, X. 2000b. A framework for symmetric band reduction. ACM Trans. Math. Softw. 26, 4, 581--601.

Digital Library

[17]

Bischof, C., Marques, M., and Sun, X. 1993. Parallel bandreduction and tridiagonalization. In Proceedings of the 6th SIAM Conference on Parallel Processing for Scientific Computing. Vol. 1, SIAM, 383--390.

[18]

Bischof, C. and Sun, X. 1992. A framework for symmetric band reduction and tridiagonalization. Tech. Rep. MCS-P298-0392, Argonne National Laboratory.

[19]

Blackford, L. S., Choi, J., Cleary, A., et al. 1997. ScaLAPACK Users' Guide. SIAM, Philadelphia, PA. http://www.netlib.org/scalapack/.

[20]

Bowdler, H., Martin, R., Reinsch, C., and Wilkinson, J. 1968. The QR and QL algorithms for symmetric matrices. Numerische Mathematik 11, 4, 293--306.

Digital Library

[21]

Braman, K., Byers, R., and Mathias, R. 2002. The multishift QR algorithm. part i: Maintaining wellfocused shifts and level 3 performance. SIAM J. Matrix Anal. Appl. 23, 4, 929--947.

Digital Library

[22]

Bruck, J., Ho, C.-T., Kipnis, S., and Weathersby, D. 1994. Efficient algorithms for all-to-all communications in multi-port message-passing systems. In Proceedings of the 6th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'94). ACM, New York, 298--309.

Digital Library

[23]

Cannon, L. 1969. A cellular computer to implement the Kalman filter algorithm. Ph.D. thesis, Montana State University, Bozeman, MT.

Digital Library

[24]

Cuppen, J. 1980. A divide and conquer method for the symmetric tridiagonal eigenproblem. Numerische Mathematik 36, 2, 177--195.

Digital Library

[25]

Demmel, J., Grigori, L., Hoemmen, M., and Langou, J. 2012. Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput. 34, 1, A206--A239.

Digital Library

[26]

Demmel, J., Marques, O., Parlett, B., and Vömel, C. 2008. Performance and accuracy of LAPACK's symmetric tridiagonal eigensolvers. SIAM J. Sci. Comput. 30, 3, 1508--1526.

Digital Library

[27]

Dhillon, I. S. and Parlett, B. N. 2004. Multiple representations to compute orthogonal eigenvectors of symmetric tridiagonal matrices. Linear Algebra Appl. 387, 1--28.

[28]

Dongarra, J. J., Sorensen, D. C., and Hammarling, S. J. 1989. Block reduction of matrices to condensed forms for eigenvalue computations. J. Comput. Appl. Math. 27, 1--2, 215--227.

[29]

Fuller, S. and Millett, L., Eds. 2011. The Future of Computing Performance: Game Over or Next Level? National Academies Press, Washington, D.C.

Digital Library

[30]

Gansterer, W. N., Kvasnicka, D. F., and Ueberhuber, C. W. 1999. Multi-sweep algorithms for the symmetric eigenproblem. In Vector and Parallel Processing, V. Hernandez, J. Palma, and J. J. Dongarra, Eds., Lecture Notes in Computer Science, vol. 1573, Springer, 20--28.

Digital Library

[31]

Granat, R., Kågström, B., and Kressner, D. 2010. A novel parallel QR algorithm for hybrid distributed memory hpc systems. SIAM J. Sci. Comput. 32, 4, 2345--2378.

Digital Library

[32]

Grosser, B. and Lang, B. 1999. Efficient parallel reduction to bidiagonal form. Parallel Comput. 25, 8, 969--986.

Digital Library

[33]

Gu, M. and Eisenstat, S. 1992. A stable algorithm for the rank-1 modification of the symmetric eigenproblem. Tech. Rep. YALEU/DCS/RR-916, Yale University.

[34]

Haidar, A., Kurzak, J., and Luszczek, P. 2013a. An improved parallel singular value algorithm and its implementation for multicore hardware. In Proceedings the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, New York, 90:1--90:12.

Digital Library

[35]

Haidar, A., Ltaief, H., and Dongarra, J. 2011. Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 8.

Digital Library

[36]

Haidar, A., Ltaief, H., and Dongarra, J. 2012a. Toward a high performance tile divide and conquer algorithm for the dense symmetric eigenvalue problem. SIAM J. Sci. Comput. 34, 6, C249--C274.

[37]

Haidar, A., Ltaief, H., Luszczek, P., and Dongarra, J. 2012b. A comprehensive study of task coalescing for selecting parallelism granularity in a two-stage bidiagonal reduction. In Proceedings of the IEEE 26th International Parallel & Distributed Processing Symposium. IEEE, 25--35.

Digital Library

[38]

Haidar, A., Tomov, S., Dongarra, J., Solcå, R., and Schulthess, T. 2013b. A novel hybrid CPU-GPU generalized eigensolver for electronic structure calculations based on fine-grained memory aware tasks. Int. J. High Perform. Comput. Appl.

[39]

Hong, J. and Kung, H. 1981. I/O complexity: The red-blue pebble game. In Proceedings of the 13th Annual ACM Symposium on Theory of Computing (STOC'81). ACM, New York, 326--333.

Digital Library

[40]

Howell, G. W., Demmel, J. W., Fulton, C. T., Hammarling, S., and Marmol, K. 2008. Cache efficient bidiagonalization using BLAS 2.5 operators. ACM Trans. Math. Softw. 34, 3, 14:1--14:33.

Digital Library

[41]

Karlsson, L. and Kàgström, B. 2011. Parallel two-stage reduction to Hessenberg form using dynamic scheduling on shared-memory architectures. Parallel Comput. 37, 12, 771--782.

Digital Library

[42]

Kaufman, L. 1984. Banded eigenvalue solvers on vector machines. ACM Trans. Math. Softw. 10, 73--86.

Digital Library

[43]

Kaufman, L. 2000. Band reduction algorithms revisited. ACM Trans. Math. Softw. 26, 551--567.

Digital Library

[44]

Lang, B. 1991. Parallele reduktion symmetrischer bandmatrizen auf tridiagonalgestalt. Ph.D. thesis, Fakultät für Informatik, Technische Universität München.

[45]

Lang, B. 1993. A parallel algorithm for reducing symmetric banded matrices to tridiagonal form. SIAM J. Sci. Comput. 14, 6, 1320--1338.

Digital Library

[46]

Lang, B. 1996. Parallel reduction of banded matrices to bidiagonal form. Parallel Comput. 22, 1, 1--18.

Digital Library

[47]

Lang, B. 1999. Efficient eigenvalue and singular value computations on shared memory machines. Parallel Comput. 25, 7, 845--860.

Digital Library

[48]

Ltaief, H., Luszczek, P., and Dongarra, J. 2013. High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures. ACM Trans. Math. Softw. 39, 3, 16:1--16:22.

Digital Library

[49]

Luszczek, P., Ltaief, H., and Dongarra, J. 2011. Two-stage tridiagonal reduction for dense symmetric matrices using tile algorithms on multicore architectures. In Proceedings of the IEEE International Parallel & Distributed Processing Symposium (IPDPS'11). IEEE, 944--955.

Digital Library

[50]

Murata, K. and Horikoshi, K. 1975. A new method for the tridiagonalization of the symmetric band matrix. Inf. Process. Japan 15, 108--112.

[51]

Rajamanickam, S. 2009. Efficient algorithms for sparse singular value decomposition. Ph.D. thesis, University of Florida.

[52]

Rutishauser, H. 1963. On Jacobi rotation patterns. In Proceedings of Symposia in Applied Mathematics. Vol. 15, AMS, 219--239.

[53]

Schreiber, R. and Van Loan, C. 1989. A storage-efficient wy representation for products of householder transformations. SIAM J. Sci. Stat. Comput. 10, 1, 53--57.

Digital Library

[54]

Schwarz, H. 1963. Algorithm 183: Reduction of a symmetric bandmatrix to triple diagonal form. Commun. ACM 6, 6, 315--316.

Digital Library

[55]

Schwarz, H. 1968. Tridiagonalization of a symmetric band matrix. Numerische Mathematik 12, 231--241.

Digital Library

[56]

Smith, C., Hendrickson, B., and Jessup, E. 1994. A parallel algorithm for householder tridiagonalization. In Proceedings of the 5th SIAM Conference on Applied Linear Algebra. 361--365.

[57]

Van Zee, F., Van De Geijn, R., and Quintana-Orti, G. 20134. Restructuring the QR algorithm for performance. ACM Trans. Math. Softw. 40, 3.

Digital Library

[58]

Wilkinson, J. 1962. Householder's method for symmetric matrices. Numerische Mathematik 4, 1, 354--361.

Digital Library

[59]

Yamazaki, I., Dong, T., Tomov, S., and Dongarra, J. 2013. Tridiagonalization of a symmetric dense matrix on a GPU cluster. In Proceedings of the 27th International Parallel and Distributed Processing Symposium Workshops PhD Forum. 1070--1079.

Digital Library

Cited By

Mele VRomano D(2024)Generalized Ware-Amdhal Law2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00037(215-221)Online publication date: 20-Mar-2024
https://doi.org/10.1109/PDP62718.2024.00037
Manin VLang B(2023)Efficient parallel reduction of bandwidth for symmetric matricesParallel Computing10.1016/j.parco.2023.102998115:COnline publication date: 1-Feb-2023
https://dl.acm.org/doi/10.1016/j.parco.2023.102998
Mele VLaccetti G(2023)Algorithm and Software Overhead: A Theoretical Approach to Performance PortabilityParallel Processing and Applied Mathematics10.1007/978-3-031-30445-3_8(89-100)Online publication date: 27-Apr-2023
https://doi.org/10.1007/978-3-031-30445-3_8
Show More Cited By

Index Terms

Avoiding Communication in Successive Band Reduction
1. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Computations on matrices

Recommendations

A Communication-Avoiding Parallel Algorithm for the Symmetric Eigenvalue Problem
SPAA '17: Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures

Many large-scale scientific computations require eigenvalue solvers in a scaling regime where efficiency is limited by data movement. We introduce a parallel algorithm for computing the eigenvalues of a dense symmetric matrix, which performs ...
Read More
Communication avoiding successive band reduction
PPOPP '12

The running time of an algorithm depends on both arithmetic and communication (i.e., data movement) costs, and the relative costs of communication are growing over time. In this work, we present both theoretical and practical results for ...
Read More
Communication avoiding successive band reduction
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming

The running time of an algorithm depends on both arithmetic and communication (i.e., data movement) costs, and the relative costs of communication are growing over time. In this work, we present both theoretical and practical results for ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing

ACM Transactions on Parallel Computing Volume 1, Issue 2

Special Issue on PPOPP 2012

January 2015

224 pages

ISSN:2329-4949

EISSN:2329-4957

DOI:10.1145/2737841

Editor:
Phillip B. Gibbons
Intel Labs, Pittsburgh, USA

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 February 2015

Accepted: 01 July 2014

Revised: 01 July 2014

Received: 01 April 2013

Published in TOPC Volume 1, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Center for Future Architecture Research
Lockheed Martin Corporation
Sandia National Laboratories
US DOE
U.S. Department of Energy Contract
Microsoft
ParLab
DARPA
Math Works
NSF
Intel
STARnet
National Instruments
Sandia National Laboratories Truman Fellowship in National Security Science and Engineering
Samsung
UC Discovery
Nokia
NVIDIA
Sandia Corporation
Oracle
Semiconductor Research Corporation
MARCO

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
483
Total Downloads

Downloads (Last 12 months)65
Downloads (Last 6 weeks)5

Other Metrics

View Author Metrics

Citations

Cited By

Mele VRomano D(2024)Generalized Ware-Amdhal Law2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00037(215-221)Online publication date: 20-Mar-2024
https://doi.org/10.1109/PDP62718.2024.00037
Manin VLang B(2023)Efficient parallel reduction of bandwidth for symmetric matricesParallel Computing10.1016/j.parco.2023.102998115:COnline publication date: 1-Feb-2023
https://dl.acm.org/doi/10.1016/j.parco.2023.102998
Mele VLaccetti G(2023)Algorithm and Software Overhead: A Theoretical Approach to Performance PortabilityParallel Processing and Applied Mathematics10.1007/978-3-031-30445-3_8(89-100)Online publication date: 27-Apr-2023
https://doi.org/10.1007/978-3-031-30445-3_8
Poulson J(2020)High-performance sampling of generic determinantal point processesPhilosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences10.1098/rsta.2019.0059378:2166(20190059)Online publication date: 20-Jan-2020
https://doi.org/10.1098/rsta.2019.0059
Del Ben MMarques OCanning A(2019)Improved Unconstrained Energy Functional Method for Eigensolvers in Electronic Structure CalculationsProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337914(1-11)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337914
Dongarra JGates MHaidar AKurzak JLuszczek PTomov SYamazaki I(2018)The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme ScaleSIAM Review10.1137/17M111773260:4(808-865)Online publication date: 8-Nov-2018
https://doi.org/10.1137/17M1117732
Alonso PCatalán SHerrero JQuintana-Ortí ERodríguez-Sánchez R(2018)Two-sided orthogonal reductions to condensed forms on asymmetric multicore processorsParallel Computing10.1016/j.parco.2018.03.00578:C(85-100)Online publication date: 1-Oct-2018
https://dl.acm.org/doi/10.1016/j.parco.2018.03.005
Solomonik EBallard GDemmel JHoefler TScheideler CHajiaghayi M(2017)A Communication-Avoiding Parallel Algorithm for the Symmetric Eigenvalue ProblemProceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3087556.3087561(111-121)Online publication date: 24-Jul-2017
https://dl.acm.org/doi/10.1145/3087556.3087561
Ishigami HHasegawa HKimura KNakamura Y(2017)A Parallel Bisection and Inverse Iteration Solver for a Subset of Eigenpairs of Symmetric Band MatricesEigenvalue Problems: Algorithms, Software and Applications in Petascale Computing10.1007/978-3-319-62426-6_3(31-50)Online publication date: 28-Sep-2017
https://doi.org/10.1007/978-3-319-62426-6_3

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents