research-article

Graph expansion and communication costs of fast matrix multiplication: regular submission

Authors:

Oded SchwartzAuthors Info & Claims

SPAA '11: Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures

Pages 1 - 12

https://doi.org/10.1145/1989493.1989495

Published: 04 June 2011 Publication History

Abstract

The communication cost of algorithms (also known as I/O-complexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen's and other fast matrix multiplication algorithms, and obtain the first lower bounds on their communication costs. For sequential algorithms these bounds are attainable and so optimal.

References

[1]

N. Ahmed and K. Pingali. Automatic generation of block-recursive codes. In Euro-Par '00: Proceedings from the 6th International Euro-Par Conference on Parallel Processing, pages 368--378, London, UK, 2000. Springer-Verlag.

Digital Library

[2]

N. Alon, O. Schwartz, and A. Shapira. An elementary construction of constant-degree expanders. Combinatorics, Probability & Computing, 17(3):319--327, 2008.

Digital Library

[3]

A. Aggarwal and J. S. Vitter. The input/output complexity of sorting and related problems. Commun. ACM, 31(9):1116--1127, 1988.

Digital Library

[4]

D. H. Bailey. Extra-high speed matrix multiplication on the Cray-2. SIAM J. Sci. Stat. Comput, 9:603--607, 1988.

Digital Library

[5]

M. A. Bender, G. S. Brodal, R. Fagerberg, R. Jacob, and E. Vicari. Optimal sparse matrix dense vector multiplication in the I/O-model. In SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, pages 61--70, New York, NY, USA, 2007. ACM.

Digital Library

[6]

G. E. Blelloch, R. A. Chowdhury, P. B. Gibbons, V. Ramachandran, S. Chen, and M. Kozuch. Provably good multicore cache performance for divide-and-conquer algorithms. In SODA '08: Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms, pages 501--510, Philadelphia, PA, USA, 2008. Society for Industrial and Applied Mathematics.

Digital Library

[7]

P. Burgisser, M. Clausen, and M. A. Shokrollahi. Algebraic Complexity Theory. Number 315 in Grundlehren der mathematischen Wissenschaften. Springer Verlag, 1997.

Digital Library

[8]

G. Ballard, J. Demmel, and A. Gearhart. Communication bounds for heterogeneous architectures. In 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2011), 2011. (to appear as a "brief announcement").

Digital Library

[9]

G. Ballard, J. Demmel, O. Holtz, E. Rom, and O. Schwartz. Communication-Minimizing Parallel Implementation for Strassen's Algorithm. Unpublished, 2011.

[10]

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Communication-optimal parallel and sequential Cholesky decomposition. SIAM Journal on Scientific Computing, 32(6):3495--3523, December 2010.

Digital Library

[11]

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Minimizing communication in linear algebra. Submitted. Available from http://arxiv.org/abs/0905.2485, 2010.

[12]

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Minimizing Communication in Fast Linear Algebra. Unpublished, 2011.

[13]

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Revisiting Coppersmith's "Rectangular matrix multiplication revisited" for I/O-Complexity. Unpublished, 2011.

[14]

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. The Communication Costs of Hybrid Algorithms for Fast Matrix Multiplication. Unpublished, 2011.

[15]

D. Bini. Relations between exact and approximate bilinear algorithms. applications. Calcolo, 17:87--97, 1980. 10.1007/BF02575865.

[16]

G. Bilardi and F. Preparata. Processor-time tradeoffs under bounded-speed message propagation: Part II, lower boundes. Theory of Computing Systems, 32(5):1432--4350, 1999.

[17]

G. Bilardi, A. Pietracaprina, and P. D'Alberto. On the space and access complexity of computation DAGs. In WG '00: Proceedings of the 26th International Workshop on Graph-Theoretic Concepts in Computer Science, pages 47--58, London, UK, 2000. Springer-Verlag.

Digital Library

[18]

Y. D. Burago and V. A. Zalgaller. Geometric Inequalities, volume 285 of Grundlehren der Mathematische Wissenschaften. Springer, Berlin, 1988.

[19]

L. Cannon. A cellular computer to implement the Kalman filter algorithm. PhD thesis, Montana State University, Bozeman, MN, 1969.

Digital Library

[20]

H. Cohn, R. D. Kleinberg, B. Szegedy, and C. Umans. Group-theoretic algorithms for matrix multiplication. In FOCS, pages 379--388, 2005.

Digital Library

[21]

D. Coppersmith. Rectangular matrix multiplication revisited. J. Complex., 13:42--49, March 1997.

Digital Library

[22]

R. A. Chowdhury and V. Ramachandran. Cache-oblivious dynamic programming. In SODA '06: Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 591--600, New York, NY, USA, 2006. ACM.

Digital Library

[23]

D. Coppersmith and S. Winograd. On the asymptotic complexity of matrix multiplication. SIAM Journal on Computing, 11(3):472--492, 1982.

[24]

D. Coppersmith and S. Winograd. Matrix multiplication via arithmetic progressions. In Proceedings of the nineteenth annual ACM symposium on Theory of computing, STOC '87, pages 1--6, New York, NY, USA, 1987. ACM.

Digital Library

[25]

D. Coppersmith and S. Winograd. Matrix multiplication via arithmetic progressions. J. Symb. Comput., 9(3):251--280, 1990.

Digital Library

[26]

C. C. Douglas, M. Heroux, G. Slishman, and R. M. Smith. GEMMW: A portable level 3 BLAS Winograd variant of Strassen's matrix-matrix multiply algorithm. Journal of Computational Physics, 110(1):1--10, 1994.

Digital Library

[27]

F. Desprez and F. Suter. Impact of mixed-parallelism on parallel implementations of the Strassen and Winograd matrix multiplication algorithms: Research articles. Concurrency and Computation: Practice and Experience, 16(8):771--797, 2004.

Digital Library

[28]

M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In FOCS '99: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, page 285, Washington, DC, USA, 1999. IEEE Computer Society.

Digital Library

[29]

S. L. Graham, M. Snir, and C. A. Patterson, editors. Getting up to Speed: The Future of Supercomputing. Report of National Research Council of the National Academies Sciences. The National Academies Press, Washington, D.C., 2004. 289 pages, http://www.nap.edu.

[30]

J. W. Hong and H. T. Kung. I/O complexity: The red-blue pebble game. In STOC '81: Proceedings of the thirteenth annual ACM symposium on Theory of computing, pages 326--333, New York, NY, USA, 1981. ACM.

Digital Library

[31]

S. Huss-Lederman, E. M. Jacobson, J. R. Johnson, A. Tsao, and T. Turnbull. Implementation of Strassen's algorithm for matrix multiplication. In Supercomputing '96: Proceedings of the 1996 ACM/IEEE conference on Supercomputing (CDROM), page 32, Washington, DC, USA, 1996. IEEE Computer Society.

Digital Library

[32]

D. Irony, S. Toledo, and A. Tiskin. Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput., 64(9):1017--1026, 2004.

Digital Library

[33]

M. Koucky, V. Kabanets, and A. Kolokolova. Expanders made elementary, 2010. In preparation, Available from http://www.cs.sfu.ca/~kabanets/papers/expanders.pdf.

[34]

C. E. Leiserson. Personal communication with G. Ballard, J. Demmel, O. Holtz, and O. Schwartz, 2008.

[35]

G. Lev and L. G. Valiant. Size bounds for superconcentrators. Theoretical Computer Science, 22(3):233--251, 1983.

[36]

L. H. Loomis and H. Whitney. An inequality related to the isoperimetric inequality. Bulletin of the AMS, 55:961--962, 1949.

[37]

M. Mihail. Conductance and convergence of Markov chains: A combinatorial treatment of expanders. In Proceedings of the Thirtieth Annual IEEE Symposium on Foundations of Computer Science, pages 526--531, 1989.

Digital Library

[38]

J. P. Michael, M. Penner, and V. K. Prasanna. Optimizing graph algorithms for improved cache performance. In Proc. Int'l Parallel and Distributed Processing Symp. (IPDPS 2002), Fort Lauderdale, FL, pages 769--782, 2002.

Digital Library

[39]

V. Y. Pan. New fast algorithms for matrix operations. SIAM Journal on Computing, 9(2):321--342, 1980.

[40]

R. Raz. On the complexity of matrix product. SIAM J. Comput., 32(5):1356--1369 (electronic), 2003.

Digital Library

[41]

F. Romani. Some properties of disjoint sums of tensors related to matrix multiplication. SIAM Journal on Computing, 11(2):263--267, 1982.

[42]

O. Reingold, S. Vadhan, and A. Wigderson. Entropy waves, the zig-zag graph product, and new constant-degree expanders. Annals of Mathematics, 155(1):157--187, 2002.

[43]

J. Savage. Space-time tradeoffs in memory hierarchies. Technical report, Brown University, Providence, RI, USA, 1994.

Digital Library

[44]

A. Schönhage. Partial and total matrix multiplication. SIAM Journal on Computing, 10(3):434--455, 1981.

[45]

V. Strassen. Gaussian elimination is not optimal. Numer. Math., 13:354--356, 1969.

Digital Library

[46]

V. Strassen. Relative bilinear complexity and matrix multiplication. Journal fur die reine und angewandte Mathematik (Crelles Journal), 1987(375-376):406--443, 1987.

[47]

S. Toledo. Locality of reference in LU decomposition with partial pivoting. SIAM J. Matrix Anal. Appl., 18(4):1065--1081, 1997.

Digital Library

[48]

V. Volkov and J. Demmel. Benchmarking GPUs to tune dense linear algebra. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--11, Piscataway, NJ, USA, 2008. IEEE Press.

Digital Library

[49]

S. Winograd. On the multiplication of 2 x 2 matrices. Linear Algebra Appl., 4(4):381--388., October 1971.

[50]

C.-Q. Yang and B.P. Miller. Critical path analysis for the execution of parallel and distributed programs. In Proceedings of the 8th International Conference on Distributed Computing Systems, pages 366--373, Jun. 1988.

Cited By

Moran YSchwartz OAgrawal KShun J(2023)Multiplying 2 × 2 Sub-Blocks Using 4 MultiplicationsProceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3558481.3591083(379-390)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3558481.3591083
Li XZhang MChen KWu YQian XZheng W(2021)3-D Partitioning for Large-Scale Graph ProcessingIEEE Transactions on Computers10.1109/TC.2020.298673670:1(111-127)Online publication date: 1-Jan-2021
https://doi.org/10.1109/TC.2020.2986736
Nie QMalik S(2020)MemFlow: Memory-Driven Data Scheduling With Datapath Co-Design in Accelerators for Large-Scale Inference ApplicationsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2019.292537739:9(1875-1888)Online publication date: Sep-2020
https://doi.org/10.1109/TCAD.2019.2925377
Show More Cited By

Index Terms

Graph expansion and communication costs of fast matrix multiplication: regular submission
1. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Computations on matrices

Recommendations

Graph expansion and communication costs of fast matrix multiplication

The communication cost of algorithms (also known as I/O-complexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen's and other fast matrix multiplication algorithms, ...
Matrix Multiplication I/O-Complexity by Path Routing
SPAA '15: Proceedings of the 27th ACM symposium on Parallelism in Algorithms and Architectures

We apply a novel technique based on path routings to obtain optimal I/O-complexity lower bounds for all Strassen-like fast matrix multiplication algorithms computed in serial or in parallel, assuming no reuse of nontrivial intermediate linear ...
Communication-optimal parallel algorithm for strassen's matrix multiplication
SPAA '12: Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures

Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SPAA '11: Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures

June 2011

404 pages

ISBN:9781450307437

DOI:10.1145/1989493

Co-chairs:
Friedhelm Meyer auf der Heide
University of Paderborn, Germany
,
Rajmohan Rajaraman
Northeastern University, USA

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

EATCS: European Association for Theoretical Computer Science

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 June 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SPAA '11

Sponsor:

SPAA '11: 23rd ACM Symposium on Parallelism in Algorithms and Architectures

June 4 - 6, 2011

California, San Jose, USA

Acceptance Rates

Overall Acceptance Rate 447 of 1,461 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
311
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Moran YSchwartz OAgrawal KShun J(2023)Multiplying 2 × 2 Sub-Blocks Using 4 MultiplicationsProceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3558481.3591083(379-390)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3558481.3591083
Li XZhang MChen KWu YQian XZheng W(2021)3-D Partitioning for Large-Scale Graph ProcessingIEEE Transactions on Computers10.1109/TC.2020.298673670:1(111-127)Online publication date: 1-Jan-2021
https://doi.org/10.1109/TC.2020.2986736
Nie QMalik S(2020)MemFlow: Memory-Driven Data Scheduling With Datapath Co-Design in Accelerators for Large-Scale Inference ApplicationsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2019.292537739:9(1875-1888)Online publication date: Sep-2020
https://doi.org/10.1109/TCAD.2019.2925377
Purkayastha AHammond SNagappan RAlt M(2018) Holistic Approaches to HPC Power and Workflow Management * 2018 Ninth International Green and Sustainable Computing Conference (IGSC)10.1109/IGCC.2018.8752150(1-8)Online publication date: Oct-2018
https://doi.org/10.1109/IGCC.2018.8752150
Grigori L(2017)Introduction to Communication Avoiding Algorithms for Direct Methods of Factorization in Linear AlgebraComputational Mathematics, Numerical Analysis and Applications10.1007/978-3-319-49631-3_4(153-185)Online publication date: 5-Aug-2017
https://doi.org/10.1007/978-3-319-49631-3_4
Zhang MWu YChen KQian XLi XZheng WKeeton KRoscoe T(2016)Exploring the hidden dimension in graph processingProceedings of the 12th USENIX conference on Operating Systems Design and Implementation10.5555/3026877.3026900(285-300)Online publication date: 2-Nov-2016
https://dl.acm.org/doi/10.5555/3026877.3026900
Murthy KMellor-Crummey J(2015)Communication Avoiding AlgorithmsProceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT)10.1109/PACT.2015.41(150-162)Online publication date: 18-Oct-2015
https://dl.acm.org/doi/10.1109/PACT.2015.41
Ballard GDemmel JHoltz OSchwartz O(2014)Communication costs of Strassen's matrix multiplicationCommunications of the ACM10.1145/2556647.255666057:2(107-114)Online publication date: 1-Feb-2014
https://dl.acm.org/doi/10.1145/2556647.2556660
Fauzia NElango VRavishankar MRamanujam JRastello FRountev APouchet LSadayappan P(2013)Beyond reuse distance analysisACM Transactions on Architecture and Code Optimization10.1145/2541228.255530910:4(1-29)Online publication date: 1-Dec-2013
https://dl.acm.org/doi/10.1145/2541228.2555309
Ballard GDemmel JHoltz OSchwartz O(2013)Graph expansion and communication costs of fast matrix multiplicationJournal of the ACM10.1145/2395116.239512159:6(1-23)Online publication date: 9-Jan-2013
https://dl.acm.org/doi/10.1145/2395116.2395121
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents