Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A framework for practical parallel fast matrix multiplication

Published: 24 January 2015 Publication History

Abstract

Matrix multiplication is a fundamental computation in many scientific disciplines. In this paper, we show that novel fast matrix multiplication algorithms can significantly outperform vendor implementations of the classical algorithm and Strassen's fast algorithm on modest problem sizes and shapes. Furthermore, we show that the best choice of fast algorithm depends not only on the size of the matrices but also the shape. We develop a code generation tool to automatically implement multiple sequential and shared-memory parallel variants of each fast algorithm, including our novel parallelization scheme. This allows us to rapidly benchmark over 20 fast algorithms on several problem sizes. Furthermore, we discuss a number of practical implementation issues for these algorithms on shared-memory machines that can direct further research on making fast algorithms practical.

References

[1]
AMD. AMD core math library user guide, 2014. Version 6.0.
[2]
D. H. Bailey. Extra high speed matrix multiplication on the Cray-2. SIAM Journal on Scientific and Statistical Computing, 9(3):603–607, 1988.
[3]
G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz. Communication-optimal parallel algorithm for Strassen’s matrix multiplication. In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures, pages 193–204. ACM, 2012.
[4]
G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Graph expansion and communication costs of fast matrix multiplication. Journal of the ACM, 59(6):32, 2012.
[5]
D. Bini, M. Capovani, F. Romani, and G. Lotti. O(n 2.7799 ) complexity for n × n approximate matrix multiplication. Information Processing Letters, 8(5):234 – 235, 1979.
[6]
D. Bini, G. Lotti, and F. Romani. Approximate solutions for the bilinear form computational problem. SIAM Journal on Computing, 9 (4):692–697, 1980.
[7]
R. P. Brent. Algorithms for matrix multiplication. Technical report, Stanford University, Stanford, CA, USA, 1970.
[8]
Cray. Cray application developer’s environment user’s guide, 2012. Release 3.1.
[9]
P. D’Alberto, M. Bodrato, and A. Nicolau. Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems. ACM Transactions on Mathematical Software, 38(1):2, 2011.
[10]
H. F. de Groote. On varieties of optimal algorithms for the computation of bilinear mappings I. the isotropy group of a bilinear mapping. Theoretical Computer Science, 7(1):1 – 24, 1978.
[11]
C. C. Douglas, M. Heroux, G. Slishman, and R. M. Smith. GEMMW: a portable level 3 BLAS Winograd variant of Strassen’s matrix-matrix multiply algorithm. Journal of Computational Physics, 110(1):1–10, 1994.
[12]
F. L. Gall. Powers of tensors and fast matrix multiplication. In Proceedings of the International Symposium on Symbolic and Algebraic Computation, pages 296–303, 2014.
[13]
B. Grayson and R. Van De Geijn. A high performance parallel Strassen implementation. Parallel Processing Letters, 6(01):3–12, 1996.
[14]
N. J. Higham. Accuracy and stability of numerical algorithms. SIAM, 2002.
[15]
J. Hopcroft and J. Musinski. Duality applied to the complexity of matrix multiplication and other bilinear forms. SIAM Journal on Computing, 2(3):159–173, 1973.
[16]
J. E. Hopcroft and L. R. Kerr. On minimizing the number of multiplications necessary for matrix multiplication. SIAM Journal on Applied Mathematics, 20(1):30–36, 1971.
[17]
S. Huss-Lederman, E. M. Jacobson, J. R. Johnson, A. Tsao, and T. Turnbull. Strassen’s algorithm for matrix multiplication: Modeling, analysis, and implementation. In In Proceedings of Supercomputing ’96, pages 9–6, 1996.
[18]
IBM. Engineering and scientific software library guide and reference, 2014. Version 5, Release 3.
[19]
Intel. Math kernel library reference manual, 2014. Version 11.2.
[20]
D. Irony, S. Toledo, and A. Tiskin. Communication lower bounds for distributed-memory matrix multiplication. Journal of Parallel and Distributed Computing, 64(9):1017–1026, 2004.
[21]
R. W. Johnson and A. M. McLoughlin. Noncommutative bilinear algorithms for 3 x 3 matrix multiplication. SIAM Journal on Computing, 15(2):595–603, 1986.
[22]
I. Kaporin. The aggregation and cancellation techniques as a practical tool for faster matrix multiplication. Theoretical Computer Science, 315(2):469–510, 2004.
[23]
D. E. Knuth. The Art of Computer Programming, Volume II: Seminumerical Algorithms, 2nd Edition. Addison-Wesley, 1981. ISBN 0- 201-03822-6.
[24]
T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455–500, 2009.
[25]
B. Kumar, C.-H. Huang, P. Sadayappan, and R. W. Johnson. A tensor product formulation of Strassen’s matrix multiplication algorithm with memory reduction. Scientific Programming, 4(4):275–289, 1995.
[26]
J. Kurzak, P. Luszczek, A. YarKhan, M. Faverge, J. Langou, H. Bouwmeester, J. Dongarra, J. J. Dongarra, M. Faverge, T. Herault, et al. Multithreading in the PLASMA library. Multicore Computing: Algorithms, Architectures, and Applications, page 119, 2013.
[27]
B. Lipshitz, G. Ballard, J. Demmel, and O. Schwartz. Communicationavoiding parallel Strassen: Implementation and performance. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, page 101, 2012.
[28]
J. D. McCalpin. A survey of memory bandwidth and machine balance in current high performance computers. IEEE TCCA Newsletter, pages 19–25, 1995.
[29]
V. Y. Pan. Strassen’s algorithm is not optimal: Trilinear technique of aggregating, uniting and canceling for constructing fast algorithms for matrix operations. In Proceedings of the IEEE Symposium on Foundations of Computer Science, pages 166–176, 1978.
[30]
A. Schönhage. Partial and total matrix multiplication. SIAM Journal on Computing, 10(3):434–455, 1981.
[31]
A. Smirnov. The bilinear complexity and practical algorithms for matrix multiplication. Computational Mathematics and Mathematical Physics, 53(12):1781–1795, 2013.
[32]
V. Strassen. Gaussian elimination is not optimal. Numerische Mathematik, 13(4):354–356, 1969.
[33]
M. Thottethodi, S. Chatterjee, and A. R. Lebeck. Tuning Strassen’s matrix multiplication for memory e fficiency. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, pages 1–14, 1998.
[34]
R. A. van de Geijn and J. Watts. SUMMA: Scalable universal matrix multiplication algorithm. Concurrency-Practice and Experience, 9(4): 255–274, 1997.
[35]
F. G. Van Zee and R. A. van de Geijn. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Transactions on Mathematical Software, 2014. To appear.
[36]
V. V. Williams. Multiplying matrices faster than Coppersmith-Winograd. In Proceedings of the Forty-Fourth Annual ACM Symposium on Theory of Computing, pages 887–898. ACM, 2012.
[37]
S. Winograd. On multiplication of 2 × 2 matrices. Linear Algebra and its Applications, 4(4):381–388, 1971.

Cited By

View all
  • (2024)Algorithms for Matrix Multiplication via Sampling and Opportunistic Matrix MultiplicationAlgorithmica10.1007/s00453-024-01247-y86:9(2822-2844)Online publication date: 1-Sep-2024
  • (2023)Towards Practical Fast Matrix Multiplication based on Trilinear AggregationProceedings of the 2023 International Symposium on Symbolic and Algebraic Computation10.1145/3597066.3597099(289-297)Online publication date: 24-Jul-2023
  • (2023)Multiplying 2 × 2 Sub-Blocks Using 4 MultiplicationsProceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3558481.3591083(379-390)Online publication date: 17-Jun-2023
  • Show More Cited By

Index Terms

  1. A framework for practical parallel fast matrix multiplication

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 50, Issue 8
    PPoPP '15
    August 2015
    290 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2858788
    • Editor:
    • Andy Gill
    Issue’s Table of Contents
    • cover image ACM Conferences
      PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
      January 2015
      290 pages
      ISBN:9781450332057
      DOI:10.1145/2688500
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 January 2015
    Published in SIGPLAN Volume 50, Issue 8

    Check for updates

    Author Tags

    1. dense linear algebra
    2. fast matrix multiplication
    3. parallel linear algebra
    4. shared memory

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)82
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 03 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Algorithms for Matrix Multiplication via Sampling and Opportunistic Matrix MultiplicationAlgorithmica10.1007/s00453-024-01247-y86:9(2822-2844)Online publication date: 1-Sep-2024
    • (2023)Towards Practical Fast Matrix Multiplication based on Trilinear AggregationProceedings of the 2023 International Symposium on Symbolic and Algebraic Computation10.1145/3597066.3597099(289-297)Online publication date: 24-Jul-2023
    • (2023)Multiplying 2 × 2 Sub-Blocks Using 4 MultiplicationsProceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3558481.3591083(379-390)Online publication date: 17-Jun-2023
    • (2023)Pebbling Game and Alternative Basis for High Performance Matrix MultiplicationSIAM Journal on Scientific Computing10.1137/22M150271945:6(C277-C303)Online publication date: 15-Nov-2023
    • (2023)Amalur: Data Integration Meets Machine Learning2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00301(3729-3739)Online publication date: Apr-2023
    • (2022)Density-optimized intersection-free mapping and matrix multiplication for join-project operationsProceedings of the VLDB Endowment10.14778/3547305.354732615:10(2244-2256)Online publication date: 1-Jun-2022
    • (2022)Seamless optimization of the GEMM kernel for task-based programming modelsProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532385(1-11)Online publication date: 28-Jun-2022
    • (2022)Discovering faster matrix multiplication algorithms with reinforcement learningNature10.1038/s41586-022-05172-4610:7930(47-53)Online publication date: 5-Oct-2022
    • (2022)Efficient algorithm for proper orthogonal decomposition of block-structured adaptively refined numerical simulationsJournal of Computational Physics10.1016/j.jcp.2022.111527469(111527)Online publication date: Nov-2022
    • (2021)Optimal sequence for chain matrix multiplication using evolutionary algorithmPeerJ Computer Science10.7717/peerj-cs.3957(e395)Online publication date: 26-Feb-2021
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media