research-article

Communication-avoiding parallel strassen: implementation and performance

Authors:

Benjamin Lipshitz,

Oded SchwartzAuthors Info & Claims

SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Article No.: 101, Pages 1 - 11

Published: 10 November 2012 Publication History

Abstract

Matrix multiplication is a fundamental kernel of many high performance and scientific computing applications. Most parallel implementations use classical O(n³) matrix multiplication, even though there exist algorithms with lower arithmetic complexity. We recently presented a new Communication-Avoiding Parallel Strassen algorithm (CAPS), based on Strassen's fast matrix multiplication, that minimizes communication (SPAA '12). It communicates asymptotically less than all classical and all previous Strassen-based algorithms, and it attains theoretical lower bounds.

In this paper we show that CAPS is also faster in practice. We benchmark and compare its performance to previous algorithms on Hopper (Cray XE6), Intrepid (IBM BG/P), and Franklin (Cray XT4). We demonstrate significant speedups over previous algorithms both for large matrices and for small matrices on large numbers of processors. We model and analyze the performance of CAPS and predict its performance on future exascale platforms.

References

[1]

M. D. Adams and D. S. Wise. Seven at one stroke: Results from a cache-oblivious paradigm for scalable matrix algorithms. In MSPC '06: Proceedings of the 2006 Workshop on Memory System Performance and Correctness, pages 41--50, New York, NY, USA, 2006. ACM.

Digital Library

[2]

R. C. Agarwal, S. M. Balle, F. G. Gustavson, M. Joshi, and P. Palkar. A three-dimensional approach to parallel matrix multiplication. IBM Journal of Research and Development, 39:39--5, 1995.

Digital Library

[3]

G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz. Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds. In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA '12, pages 77--79, New York, NY, USA, 2012. ACM.

Digital Library

[4]

G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz. Communication-optimal parallel algorithm for Strassen's matrix multiplication. In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA '12, pages 193--204, New York, NY, USA, 2012. ACM.

Digital Library

[5]

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Graph expansion and communication costs of fast matrix multiplication. In SPAA '11: Proceedings of the 23rd Annual Symposium on Parallelism in Algorithms and Architectures, pages 1--12, New York, NY, USA, 2011. ACM.

Digital Library

[6]

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Minimizing communication in numerical linear algebra. SIAM J. Matrix Analysis Applications, 32(3):866--901, 2011.

[7]

J. Berntsen. Communication efficient matrix multiplication on hypercubes. Parallel Computing, 12(3):335--342, 1989.

[8]

L. Cannon. A cellular computer to implement the Kalman filter algorithm. PhD thesis, Montana State University, Bozeman, MN, 1969.

Digital Library

[9]

J. Demmel, I. Dumitriu, and O. Holtz. Fast linear algebra is stable. Numerische Mathematik, 108(1):59--91, 2007.

Digital Library

[10]

M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In FOCS '99: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, page 285, Washington, DC, USA, 1999. IEEE Computer Society.

Digital Library

[11]

B. Grayson, A. Shah, and R. van de Geijn. A high performance parallel Strassen implementation. In Parallel Processing Letters, volume 6, pages 3--12, 1995.

[12]

N. J. Higham. Accuracy and Stability of Numerical Algorithms. SIAM, Philadelphia, PA, 2nd edition, 2002.

Digital Library

[13]

D. Irony, S. Toledo, and A. Tiskin. Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput., 64(9):1017--1026, 2004.

Digital Library

[14]

Q. Luo and J. Drake. A scalable parallel Strassen's matrix multiplication algorithm for distributed-memory computers. In Proceedings of the 1995 ACM Symposium on Applied Computing, SAC '95, pages 221--226, New York, NY, USA, 1995. ACM.

Digital Library

[15]

W. F. McColl and A. Tiskin. Memory-efficient matrix multiplication in the BSP model. Algorithmica, 24:287--297, 1999. 10.1007/PL00008264.

[16]

H. Meuer, E. Strohmaier, J. Dongarra, and H. Simon. Top500 supercomputer sites, 2011. www.top500.org.

[17]

J. Shalf, S. S. Dosanjh, and J. Morrison. Exascale computing technology challenges. In J. M. L. M. Palma, M. J. Daydé, O. Marques, and J. C. Lopes, editors, High Performance Computing for Computational Science - VECPAR 2010 - 9th International conference, Berkeley, CA, USA, June 22-25, 2010, Revised Selected Papers, volume 6449 of Lecture Notes in Computer Science, pages 1--25. Springer, 2010.

Digital Library

[18]

E. Solomonik, A. Bhatele, and J. Demmel. Improving communication performance in dense linear algebra via topology aware collectives. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 77:1--77:11, New York, NY, USA, 2011. ACM.

Digital Library

[19]

E. Solomonik and J. Demmel. Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. Technical Report UCB/EECS-2011-10, EECS Department, University of California, Berkeley, Feb 2011.

[20]

R. A. van de Geijn and J. Watts. SUMMA: scalable universal matrix multiplication algorithm. Concurrency - Practice and Experience, 9(4):255--274, 1997.

Cited By

Karppa MKaski PChan T(2019)Probabilistic tensors and opportunistic boolean matrix multiplicationProceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms10.5555/3310435.3310466(496-515)Online publication date: 6-Jan-2019
https://dl.acm.org/doi/10.5555/3310435.3310466
Beniamini GSchwartz OScheideler CBerenbrink P(2019)Faster Matrix Multiplication via Sparse DecompositionThe 31st ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3323165.3323188(11-22)Online publication date: 17-Jun-2019
https://dl.acm.org/doi/10.1145/3323165.3323188
Ramanan PAfrati FSroka JYi KHidders J(2018)Six Pass MapReduce Implementation of Strassen's Algorithm for Matrix MultiplicationProceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond10.1145/3206333.3206336(1-6)Online publication date: 15-Jun-2018
https://dl.acm.org/doi/10.1145/3206333.3206336
Show More Cited By

Communication-avoiding parallel strassen: implementation and performance

Recommendations

Communication-optimal parallel algorithm for strassen's matrix multiplication
SPAA '12: Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures

Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The ...
Communication-Avoiding Parallel Strassen: Implementation and performance
SC '12: Proceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and Analysis

Matrix multiplication is a fundamental kernel of many high performance and scientific computing applications. Most parallel implementations use classical O(n3) matrix multiplication, even though there exist algorithms with lower arithmetic complexity. ...
Strassen's Communication-Avoiding Parallel Matrix Multiplication Algorithm for All-Port 2D Torus Networks
Proceedings of the 12th International Conference on Parallel Computing Technologies - Volume 7979

A parallel implementation of Strassen's matrix multiplication algorithm is proposed for massively parallel supercomputers with 2D all-port torus interconnection networks. The proposed algorithm employs conflict-free routing patterns and operates on ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

November 2012

1161 pages

ISBN:9781467308045

General Chair:
Jeffrey K. Hollingsworth
University of Maryland

Sponsors

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 10 November 2012

Check for updates

Qualifiers

Research-article

Conference

SC '12

Sponsor:

SIGHPC
SIGARCH
IEEE-CS

SC '12: International Conference for High Performance Computing, Networking, Storage and Analysis

November 10 - 16, 2012

Utah, Salt Lake City

Acceptance Rates

SC '12 Paper Acceptance Rate 100 of 461 submissions, 22%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
247
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Karppa MKaski PChan T(2019)Probabilistic tensors and opportunistic boolean matrix multiplicationProceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms10.5555/3310435.3310466(496-515)Online publication date: 6-Jan-2019
https://dl.acm.org/doi/10.5555/3310435.3310466
Beniamini GSchwartz OScheideler CBerenbrink P(2019)Faster Matrix Multiplication via Sparse DecompositionThe 31st ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3323165.3323188(11-22)Online publication date: 17-Jun-2019
https://dl.acm.org/doi/10.1145/3323165.3323188
Ramanan PAfrati FSroka JYi KHidders J(2018)Six Pass MapReduce Implementation of Strassen's Algorithm for Matrix MultiplicationProceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond10.1145/3206333.3206336(1-6)Online publication date: 15-Jun-2018
https://dl.acm.org/doi/10.1145/3206333.3206336
Deng MRamanan PAfrati FSroka JKoutris P(2017)MapReduce Implementation of Strassen's Algorithm for Matrix MultiplicationProceedings of the 4th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond10.1145/3070607.3070614(1-10)Online publication date: 14-May-2017
https://dl.acm.org/doi/10.1145/3070607.3070614
Huang JSmith THenry Gvan de Geijn RWest J(2016)Strassen's algorithm reloadedProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014983(1-12)Online publication date: 13-Nov-2016
https://dl.acm.org/doi/10.5555/3014904.3014983
Benson ABallard G(2015)A framework for practical parallel fast matrix multiplicationACM SIGPLAN Notices10.1145/2858788.268851350:8(42-53)Online publication date: 24-Jan-2015
https://dl.acm.org/doi/10.1145/2858788.2688513
Benson ABallard GCohen AGrove D(2015)A framework for practical parallel fast matrix multiplicationProceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/2688500.2688513(42-53)Online publication date: 24-Jan-2015
https://dl.acm.org/doi/10.1145/2688500.2688513
Williams RShmoys D(2014)Faster all-pairs shortest paths via circuit complexityProceedings of the forty-sixth annual ACM symposium on Theory of computing10.1145/2591796.2591811(664-673)Online publication date: 31-May-2014
https://dl.acm.org/doi/10.1145/2591796.2591811
Ballard GDemmel JHoltz OSchwartz O(2014)Communication costs of Strassen's matrix multiplicationCommunications of the ACM10.1145/2556647.255666057:2(107-114)Online publication date: 1-Feb-2014
https://dl.acm.org/doi/10.1145/2556647.2556660
Rajbhandari SNikam ALai PStock KKrishnamoorthy SSadayappan PDamkroger TDongarra J(2014)A communication-optimal framework for contracting distributed tensorsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC.2014.36(375-386)Online publication date: 16-Nov-2014
https://dl.acm.org/doi/10.1109/SC.2014.36
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents