research-article

A framework for practical parallel fast matrix multiplication

Authors:

Austin R. Benson,

Grey BallardAuthors Info & Claims

ACM SIGPLAN Notices, Volume 50, Issue 8

Pages 42 - 53

https://doi.org/10.1145/2858788.2688513

Published: 24 January 2015 Publication History

Abstract

Matrix multiplication is a fundamental computation in many scientific disciplines. In this paper, we show that novel fast matrix multiplication algorithms can significantly outperform vendor implementations of the classical algorithm and Strassen's fast algorithm on modest problem sizes and shapes. Furthermore, we show that the best choice of fast algorithm depends not only on the size of the matrices but also the shape. We develop a code generation tool to automatically implement multiple sequential and shared-memory parallel variants of each fast algorithm, including our novel parallelization scheme. This allows us to rapidly benchmark over 20 fast algorithms on several problem sizes. Furthermore, we discuss a number of practical implementation issues for these algorithms on shared-memory machines that can direct further research on making fast algorithms practical.

References

[1]

AMD. AMD core math library user guide, 2014. Version 6.0.

[2]

D. H. Bailey. Extra high speed matrix multiplication on the Cray-2. SIAM Journal on Scientific and Statistical Computing, 9(3):603–607, 1988.

Digital Library

[3]

G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz. Communication-optimal parallel algorithm for Strassen’s matrix multiplication. In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures, pages 193–204. ACM, 2012.

Digital Library

[4]

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Graph expansion and communication costs of fast matrix multiplication. Journal of the ACM, 59(6):32, 2012.

Digital Library

[5]

D. Bini, M. Capovani, F. Romani, and G. Lotti. O(n 2.7799 ) complexity for n × n approximate matrix multiplication. Information Processing Letters, 8(5):234 – 235, 1979.

[6]

D. Bini, G. Lotti, and F. Romani. Approximate solutions for the bilinear form computational problem. SIAM Journal on Computing, 9 (4):692–697, 1980.

[7]

R. P. Brent. Algorithms for matrix multiplication. Technical report, Stanford University, Stanford, CA, USA, 1970.

Digital Library

[8]

Cray. Cray application developer’s environment user’s guide, 2012. Release 3.1.

[9]

P. D’Alberto, M. Bodrato, and A. Nicolau. Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems. ACM Transactions on Mathematical Software, 38(1):2, 2011.

Digital Library

[10]

H. F. de Groote. On varieties of optimal algorithms for the computation of bilinear mappings I. the isotropy group of a bilinear mapping. Theoretical Computer Science, 7(1):1 – 24, 1978.

[11]

C. C. Douglas, M. Heroux, G. Slishman, and R. M. Smith. GEMMW: a portable level 3 BLAS Winograd variant of Strassen’s matrix-matrix multiply algorithm. Journal of Computational Physics, 110(1):1–10, 1994.

Digital Library

[12]

F. L. Gall. Powers of tensors and fast matrix multiplication. In Proceedings of the International Symposium on Symbolic and Algebraic Computation, pages 296–303, 2014.

Digital Library

[13]

B. Grayson and R. Van De Geijn. A high performance parallel Strassen implementation. Parallel Processing Letters, 6(01):3–12, 1996.

[14]

N. J. Higham. Accuracy and stability of numerical algorithms. SIAM, 2002.

[15]

J. Hopcroft and J. Musinski. Duality applied to the complexity of matrix multiplication and other bilinear forms. SIAM Journal on Computing, 2(3):159–173, 1973.

[16]

J. E. Hopcroft and L. R. Kerr. On minimizing the number of multiplications necessary for matrix multiplication. SIAM Journal on Applied Mathematics, 20(1):30–36, 1971.

[17]

S. Huss-Lederman, E. M. Jacobson, J. R. Johnson, A. Tsao, and T. Turnbull. Strassen’s algorithm for matrix multiplication: Modeling, analysis, and implementation. In In Proceedings of Supercomputing ’96, pages 9–6, 1996.

[18]

IBM. Engineering and scientific software library guide and reference, 2014. Version 5, Release 3.

[19]

Intel. Math kernel library reference manual, 2014. Version 11.2.

[20]

D. Irony, S. Toledo, and A. Tiskin. Communication lower bounds for distributed-memory matrix multiplication. Journal of Parallel and Distributed Computing, 64(9):1017–1026, 2004.

Digital Library

[21]

R. W. Johnson and A. M. McLoughlin. Noncommutative bilinear algorithms for 3 x 3 matrix multiplication. SIAM Journal on Computing, 15(2):595–603, 1986.

Digital Library

[22]

I. Kaporin. The aggregation and cancellation techniques as a practical tool for faster matrix multiplication. Theoretical Computer Science, 315(2):469–510, 2004.

Digital Library

[23]

D. E. Knuth. The Art of Computer Programming, Volume II: Seminumerical Algorithms, 2nd Edition. Addison-Wesley, 1981. ISBN 0- 201-03822-6.

[24]

T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455–500, 2009.

Digital Library

[25]

B. Kumar, C.-H. Huang, P. Sadayappan, and R. W. Johnson. A tensor product formulation of Strassen’s matrix multiplication algorithm with memory reduction. Scientific Programming, 4(4):275–289, 1995.

Digital Library

[26]

J. Kurzak, P. Luszczek, A. YarKhan, M. Faverge, J. Langou, H. Bouwmeester, J. Dongarra, J. J. Dongarra, M. Faverge, T. Herault, et al. Multithreading in the PLASMA library. Multicore Computing: Algorithms, Architectures, and Applications, page 119, 2013.

[27]

B. Lipshitz, G. Ballard, J. Demmel, and O. Schwartz. Communicationavoiding parallel Strassen: Implementation and performance. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, page 101, 2012.

Digital Library

[28]

J. D. McCalpin. A survey of memory bandwidth and machine balance in current high performance computers. IEEE TCCA Newsletter, pages 19–25, 1995.

[29]

V. Y. Pan. Strassen’s algorithm is not optimal: Trilinear technique of aggregating, uniting and canceling for constructing fast algorithms for matrix operations. In Proceedings of the IEEE Symposium on Foundations of Computer Science, pages 166–176, 1978.

Digital Library

[30]

A. Schönhage. Partial and total matrix multiplication. SIAM Journal on Computing, 10(3):434–455, 1981.

[31]

A. Smirnov. The bilinear complexity and practical algorithms for matrix multiplication. Computational Mathematics and Mathematical Physics, 53(12):1781–1795, 2013.

[32]

V. Strassen. Gaussian elimination is not optimal. Numerische Mathematik, 13(4):354–356, 1969.

Digital Library

[33]

M. Thottethodi, S. Chatterjee, and A. R. Lebeck. Tuning Strassen’s matrix multiplication for memory e fficiency. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, pages 1–14, 1998.

Digital Library

[34]

R. A. van de Geijn and J. Watts. SUMMA: Scalable universal matrix multiplication algorithm. Concurrency-Practice and Experience, 9(4): 255–274, 1997.

[35]

F. G. Van Zee and R. A. van de Geijn. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Transactions on Mathematical Software, 2014. To appear.

[36]

V. V. Williams. Multiplying matrices faster than Coppersmith-Winograd. In Proceedings of the Forty-Fourth Annual ACM Symposium on Theory of Computing, pages 887–898. ACM, 2012.

Digital Library

[37]

S. Winograd. On multiplication of 2 × 2 matrices. Linear Algebra and its Applications, 4(4):381–388, 1971.

Cited By

Harris D(2024)Algorithms for Matrix Multiplication via Sampling and Opportunistic Matrix MultiplicationAlgorithmica10.1007/s00453-024-01247-y86:9(2822-2844)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1007/s00453-024-01247-y
Hadas TSchwartz O(2023)Towards Practical Fast Matrix Multiplication based on Trilinear AggregationProceedings of the 2023 International Symposium on Symbolic and Algebraic Computation10.1145/3597066.3597099(289-297)Online publication date: 24-Jul-2023
https://dl.acm.org/doi/10.1145/3597066.3597099
Moran YSchwartz OAgrawal KShun J(2023)Multiplying 2 × 2 Sub-Blocks Using 4 MultiplicationsProceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3558481.3591083(379-390)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3558481.3591083
Show More Cited By

Index Terms

A framework for practical parallel fast matrix multiplication
1. Mathematics of computing
  1. Mathematical software

Recommendations

Matrix Multiplication, a Little Faster

Strassen’s algorithm (1969) was the first sub-cubic matrix multiplication algorithm. Winograd (1971) improved the leading coefficient of its complexity from 6 to 7. There have been many subsequent asymptotic improvements. Unfortunately, most of these ...
Communication-optimal parallel algorithm for strassen's matrix multiplication
SPAA '12: Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures

Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The ...
Fast sparse matrix multiplication

Let A and B two n×n matrices over a ring R (e.g., the reals or the integers) each containing at most m nonzero elements. We present a new algorithm that multiplies A and B using O(m^0.7n^1.2+n^2+o(1)) algebraic operations (i.e., multiplications, additions ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 50, Issue 8

PPoPP '15

August 2015

290 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/2858788

Editor:
Andy Gill
University of Kansas, Lawrence, KS

Issue’s Table of Contents

PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
January 2015
290 pages
ISBN:9781450332057
DOI:10.1145/2688500
General Chair:
Albert Cohen
INRIA, France
,
Program Chair:
David Grove
IBM Research, USA

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 January 2015

Published in SIGPLAN Volume 50, Issue 8

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

45
Total Citations
View Citations
1,107
Total Downloads

Downloads (Last 12 months)82
Downloads (Last 6 weeks)6

Reflects downloads up to 03 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Harris D(2024)Algorithms for Matrix Multiplication via Sampling and Opportunistic Matrix MultiplicationAlgorithmica10.1007/s00453-024-01247-y86:9(2822-2844)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1007/s00453-024-01247-y
Hadas TSchwartz O(2023)Towards Practical Fast Matrix Multiplication based on Trilinear AggregationProceedings of the 2023 International Symposium on Symbolic and Algebraic Computation10.1145/3597066.3597099(289-297)Online publication date: 24-Jul-2023
https://dl.acm.org/doi/10.1145/3597066.3597099
Moran YSchwartz OAgrawal KShun J(2023)Multiplying 2 × 2 Sub-Blocks Using 4 MultiplicationsProceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3558481.3591083(379-390)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3558481.3591083
Schwartz OVaknin N(2023)Pebbling Game and Alternative Basis for High Performance Matrix MultiplicationSIAM Journal on Scientific Computing10.1137/22M150271945:6(C277-C303)Online publication date: 15-Nov-2023
https://doi.org/10.1137/22M1502719
Hai RKoutras CIonescu ALi ZSun Wvan Schijndel JKang YKatsifodimos A(2023)Amalur: Data Integration Meets Machine Learning2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00301(3729-3739)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00301
Huang ZChen S(2022)Density-optimized intersection-free mapping and matrix multiplication for join-project operationsProceedings of the VLDB Endowment10.14778/3547305.354732615:10(2244-2256)Online publication date: 1-Jun-2022
https://dl.acm.org/doi/10.14778/3547305.3547326
Lorenzon AMarques SNavarro ABeltran VRauchwerger LCameron KNikolopoulos DPnevmatikatos D(2022)Seamless optimization of the GEMM kernel for task-based programming modelsProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532385(1-11)Online publication date: 28-Jun-2022
https://dl.acm.org/doi/10.1145/3524059.3532385
Fawzi ABalog MHuang AHubert TRomera-Paredes BBarekatain MNovikov AR. Ruiz FSchrittwieser JSwirszcz GSilver DHassabis DKohli P(2022)Discovering faster matrix multiplication algorithms with reinforcement learningNature10.1038/s41586-022-05172-4610:7930(47-53)Online publication date: 5-Oct-2022
https://doi.org/10.1038/s41586-022-05172-4
Meehan MSimons-Wellin SHamlington P(2022)Efficient algorithm for proper orthogonal decomposition of block-structured adaptively refined numerical simulationsJournal of Computational Physics10.1016/j.jcp.2022.111527469(111527)Online publication date: Nov-2022
https://doi.org/10.1016/j.jcp.2022.111527
Iqbal UShoukat IElahi IKanwal AFarrukh BA. Alqahtani MRauf AAlqurni J(2021)Optimal sequence for chain matrix multiplication using evolutionary algorithmPeerJ Computer Science10.7717/peerj-cs.3957(e395)Online publication date: 26-Feb-2021
https://doi.org/10.7717/peerj-cs.395
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents