Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms

Solomonik, Edgar; Demmel, James

doi:10.1007/978-3-642-23397-5_10

Edgar Solomonik¹⁸ &
James Demmel¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6853))

Included in the following conference series:

European Conference on Parallel Processing

2475 Accesses
81 Citations
3 Altmetric

Abstract

Extra memory allows parallel matrix multiplication to be done with asymptotically less communication than Cannon’s algorithm and be faster in practice. “3D” algorithms arrange the p processors in a 3D array, and store redundant copies of the matrices on each of p ^1/3 layers. ‘2D” algorithms such as Cannon’s algorithm store a single copy of the matrices on a 2D array of processors. We generalize these 2D and 3D algorithms by introducing a new class of “2.5D algorithms”. For matrix multiplication, we can take advantage of any amount of extra memory to store c copies of the data, for any \(c \in\{1,2,...,\lfloor p^{1/3}\rfloor\}\), to reduce the bandwidth cost of Cannon’s algorithm by a factor of c ^1/2 and the latency cost by a factor c ^3/2. We also show that these costs reach the lower bounds, modulo polylog(p) factors. We introduce a novel algorithm for 2.5D LU decomposition. To the best of our knowledge, this LU algorithm is the first to minimize communication along the critical path of execution in the 3D case. Our 2.5D LU algorithm uses communication-avoiding pivoting, a stable alternative to partial-pivoting. We prove a novel lower bound on the latency cost of 2.5D and 3D LU factorization, showing that while c copies of the data can also reduce the bandwidth by a factor of c ^1/2, the latency must increase by a factor of c ^1/2, so that the 2D LU algorithm (c = 1) in fact minimizes latency. We provide implementations and performance results for 2D and 2.5D versions of all the new algorithms. Our results demonstrate that 2.5D matrix multiplication and LU algorithms strongly scale more efficiently than 2D algorithms. Each of our 2.5D algorithms performs over 2X faster than the corresponding 2D algorithm for certain problem sizes on 65,536 cores of a BG/P supercomputer.

Download to read the full chapter text

Chapter PDF

Performance Predictions of Multilevel Communication Optimal LU and QR Factorizations on Hierarchical Platforms

On the Complexity and Parallel Implementation of Hensel’s Lemma and Weierstrass Preparation

Addressing Volume and Latency Overheads in 1D-parallel Sparse Matrix-Vector Multiplication

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Agarwal, R.C., Balle, S.M., Gustavson, F.G., Joshi, M., Palkar, P.: A three-dimensional approach to parallel matrix multiplication. IBM J. Res. Dev. 39, 575–582 (1995)
Article Google Scholar
Aggarwal, A., Chandra, A.K., Snir, M.: Communication complexity of PRAMs. Theoretical Computer Science 71(1), 3–28 (1990)
Article MathSciNet MATH Google Scholar
Ashcraft, C.: A taxonomy of distributed dense LU factorization methods. Boeing Computer Services Technical Report ECA-TR-161 (March 1991)
Google Scholar
Ashcraft, C.: The fan-both family of column-based distributed Cholesky factorization algorithms. In: Alan George, J.R.G., Liu, J.W.H. (eds.) Graph Theory and Sparse Matrix Computation. IMA Volumes in Mathematics and its Applications, vol. 56, pp. 159–190. Springer, Heidelberg (1993)
Chapter Google Scholar
Ballard, G., Demmel, J., Holtz, O., Schwartz, O.: Minimizing communication in numerical linear algebra. To appear in SIAM J. Mat. Anal. Appl., UCB Technical Report EECS-2009-62 (2010)
Google Scholar
Blackford, L.S., Choi, J., Cleary, A., D’Azeuedo, E., Demmel, J., Dhillon, I., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: ScaLAPACK User’s Guide, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA (1997)
Google Scholar
Cannon, L.E.: A cellular computer to implement the Kalman filter algorithm. Ph.D. thesis, Bozeman, MT, USA (1969)
Google Scholar
Dekel, E., Nassimi, D., Sahni, S.: Parallel matrix and graph algorithms. SIAM Journal on Computing 10(4), 657–675 (1981)
Article MathSciNet MATH Google Scholar
Demmel, J., Grigori, L., Xiang, H.: A Communication Optimal LU Factorization Algorithm. EECS Technical Report EECS-2010-29, UC Berkeley (March 2010)
Google Scholar
Demmel, J., Dumitriu, I., Holtz, O.: Fast linear algebra is stable. Numerische Mathematik 108, 59–91 (2007)
Article MathSciNet MATH Google Scholar
Faraj, A., Kumar, S., Smith, B., Mamidala, A., Gunnels, J.: MPI collective communications on the Blue Gene/P supercomputer: Algorithms and optimizations. In: 17th IEEE Symposium on High Performance Interconnects HOTI 2009, pp. 63–72 (2009)
Google Scholar
Grigori, L., Demmel, J.W., Xiang, H.: Communication avoiding Gaussian elimination. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing SC 2008, pp. 29:1–29:12. IEEE Press, Piscataway (2008)
Google Scholar
Gropp, W., Lusk, E., Skjellum, A.: Using MPI: portable parallel programming with the message-passing interface. MIT Press, Cambridge (1994)
MATH Google Scholar
Irony, D., Toledo, S.: Trading replication for communication in parallel distributed-memory dense solvers. Parallel Processing Letters 71, 3–28 (2002)
Google Scholar
Irony, D., Toledo, S., Tiskin, A.: Communication lower bounds for distributed-memory matrix multiplication. Journal of Parallel and Distributed Computing 64(9), 1017–1026 (2004)
Article MATH Google Scholar
Johnsson, S.L.: Minimizing the communication time for matrix multiplication on multiprocessors. Parallel Comput. 19, 1235–1257 (1993)
Article Google Scholar
Kumar, S., Dozsa, G., Almasi, G., Heidelberger, P., Chen, D., Giampapa, M.E., Michael, B., Faraj, A., Parker, J., Ratterman, J., Smith, B., Archer, C.J.: The deep computing messaging framework: generalized scalable message passing on the Blue Gene/P supercomputer. In: Proceedings of the 22nd Annual International Conference on Supercomputing ICS 2008, pp. 94–103. ACM, New York (2008)
Google Scholar
McColl, W.F., Tiskin, A.: Memory-efficient matrix multiplication in the BSP model. Algorithmica 24, 287–297 (1999)
Article MathSciNet MATH Google Scholar
Solomonik, E., Demmel, J.: Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. Tech. Rep. UCB/EECS-2011-10, EECS Department, University of California, Berkeley (February 2011), http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-10.html
Van De Geijn, R.A., Watts, J.: SUMMA: scalable universal matrix multiplication algorithm. Concurrency: Practice and Experience 9(4), 255–274 (1997)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of California at Berkeley, Berkeley, CA, USA
Edgar Solomonik & James Demmel

Authors

Edgar Solomonik
View author publications
You can also search for this author in PubMed Google Scholar
James Demmel
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Equipe Runtime, INRIA Bordeaux Sud-Ouest, 33405, Talence Cedex, France
Emmanuel Jeannot & Raymond Namyst &
Equipe HIEPACS, INRIA Bordeaux Sud-Ouest, 33405, Talence Cedex, France
Jean Roman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Solomonik, E., Demmel, J. (2011). Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms. In: Jeannot, E., Namyst, R., Roman, J. (eds) Euro-Par 2011 Parallel Processing. Euro-Par 2011. Lecture Notes in Computer Science, vol 6853. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23397-5_10

Download citation

DOI: https://doi.org/10.1007/978-3-642-23397-5_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23396-8
Online ISBN: 978-3-642-23397-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms

Abstract

Chapter PDF

Similar content being viewed by others

Performance Predictions of Multilevel Communication Optimal LU and QR Factorizations on Hierarchical Platforms

On the Complexity and Parallel Implementation of Hensel’s Lemma and Weierstrass Preparation

Addressing Volume and Latency Overheads in 1D-parallel Sparse Matrix-Vector Multiplication

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms

Abstract

Chapter PDF

Similar content being viewed by others

Performance Predictions of Multilevel Communication Optimal LU and QR Factorizations on Hierarchical Platforms

On the Complexity and Parallel Implementation of Hensel’s Lemma and Weierstrass Preparation

Addressing Volume and Latency Overheads in 1D-parallel Sparse Matrix-Vector Multiplication

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation