research-article

Network topologies and inevitable contention

Authors:

Andrew Gearhart,

Benjamin Lipshitz,

Yishai Oltchik,

Oded Schwartz, and

Sivan ToledoAuthors Info & Claims

COM-HPC '16: Proceedings of the First Workshop on Optimization of Communication in HPC

November 2016

Pages 39 - 52

Published: 13 November 2016 Publication History

Abstract

Network topologies can have significant effect on the execution costs of parallel algorithms due to inter-processor communication. For particular combinations of computations and network topologies, costly network contention may inevitably become a bottleneck, even if algorithms are optimally designed so that each processor communicates as little as possible. We obtain novel contention lower bounds that are functions of the network and the computation graph parameters. For several combinations of fundamental computations and common network topologies, our new analysis improves upon previous per-processor lower bounds which only specify the number of words communicated by the busiest individual processor. We consider torus and mesh topologies, universal fat-trees, and hypercubes; algorithms covered include classical matrix multiplication and direct numerical linear algebra, fast matrix multiplication algorithms, programs that reference arrays, N-body computations, and the FFT. For example, we show that fast matrix multiplication algorithms (e.g., Strassen's) running on a 3D torus will suffer from contention bottlenecks. On the other hand, this network is likely sufficient for a classical matrix multiplication algorithm. Our new lower bounds are matched by existing algorithms only in very few cases, leaving many open problems for network and algorithmic design.

References

[1]

A. Aggarwal, A. K. Chandra, and M. Snir. Communication complexity of PRAMs. Theoretical Computer Science, 71(1):3--28, 1990.

Digital Library

[2]

Y. Ajima, S. Sumimoto, and T. Shimizu. Tofu: A 6D mesh/torus interconnect for exascale computers. Computer, 42(11):36--40, Nov 2009.

Digital Library

[3]

G. Ballard. Avoiding Communication in Dense Linear Algebra. PhD thesis, EECS Department, University of California, Berkeley, Aug 2013.

[4]

G. Ballard, A. Buluç, J. Demmel, L. Grigori, B. Lipshitz, O. Schwartz, and S. Toledo. Communication optimal parallel multiplication of sparse random matrices. In SPAA'13: Proceedings of the 25rd ACM Symposium on Parallelism in Algorithms and Architectures, 2013.

Digital Library

[5]

G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz. Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds. In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA '12, pages 77--79, New York, NY, USA, 2012. ACM.

Digital Library

[6]

G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz. Graph expansion analysis for communication costs of fast rectangular matrix multiplication. In G. Even and D. Rawitz, editors, Design and Analysis of Algorithms, volume 7659 of Lecture Notes in Computer Science, pages 13--36. Springer Berlin Heidelberg, 2012.

Digital Library

[7]

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Minimizing communication in numerical linear algebra. SIAM Journal on Matrix Analysis and Applications, 32(3):866--901, 2011.

[8]

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Graph expansion and communication costs of fast matrix multiplication. Journal of the ACM, 59(6):32:1--32:23, Dec. 2012.

Digital Library

[9]

P. Bay and G. Bilardi. Deterministic on-line routing on area-universal networks. In Proceedings of the 31st Annual Symposium on the Foundations of Computer Science (FOCS), pages 297--306, 1990.

Digital Library

[10]

J. Berntsen. Communication efficient matrix multiplication on hypercubes. Parallel Computing, 12(3):335 -- 342, 1989.

[11]

G. Bilardi and L. De Stefani. The I/O complexity of strassen's matrix multiplication with recomputation. arXiv preprint arXiv:1605.02224, 2016.

[12]

G. Bilardi and F. P. Preparata. Area-time lower-bound techniques with applications to sorting. Algorithmica, 1(1-4):65--91, 1986.

Digital Library

[13]

G. Bilardi, M. Scquizzato, and F. Silvestri. A lower bound technique for communication on BSP with application to the FFT. In Euro-Par 2012 Parallel Processing, pages 676--687. Springer, 2012.

Digital Library

[14]

L. S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK Users' Guide. SIAM, Philadelphia, PA, USA, May 1997. Also available from http://www.netlib.org/scalapack/.

Digital Library

[15]

B. Bollobs and I. Leader. Edge-isoperimetric inequalities in the grid. Combinatorica, 11(4):299--314, 1991.

[16]

L. Cannon. A cellular computer to implement the Kalman filter algorithm. PhD thesis, Montana State University, Bozeman, MN, 1969.

Digital Library

[17]

E. Chan, M. Heimlich, A. Purkayastha, and R. Van De Geijn. Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience, 19(13):1749--1783, 2007.

Digital Library

[18]

D. Chen, N. Eisley, P. Heidelberger, R. Senger, Y. Sugawara, S. Kumar, V. Salapura, D. Satterfield, B. Steinmacher-Burow, and J. Parker. The IBM BG/Q Interconnection Fabric. IEEE Micro, 32(1):32--43, 2012.

Digital Library

[19]

M. Christ, J. Demmel, N. Knight, T. Scanlon, and K. Yelick. Communication lower bounds and optimal algorithms for programs that reference arrays - part 1. Technical Report UCB/EECS-2013-61, EECS Department, University of California, Berkeley, 2013.

[20]

Cray. Cray XK7 brochure, 2011.

[21]

J. Dongarra. Visit to the National University for Defense Technology Changsha, China, June 2013.

[22]

M. Driscoll, E. Georganas, P. Koanantakool, E. Solomonik, and K. Yelick. A communication-optimal N-body algorithm for direct interactions. In Proceedings of IPDPS '13, 2013.

Digital Library

[23]

F. L. Gall. Powers of tensors and fast matrix multiplication. In Proceedings of the 39th International Symposium on Symbolic and Algebraic Computation (ISSAC 2014), pages 296--303, 2014.

Digital Library

[24]

M. T. Goodrich. Communication-efficient parallel sorting. SIAM J. Computing, 29(2):416--432, 1999.

Digital Library

[25]

R. I. Greenberg and C. E. Leiserson. Randomized routing on fat-tress. In Proceedings of the 26th Annual Symposium on the Foundations of Computer Science (FOCS), pages 241--249, 1985.

Digital Library

[26]

J. W. Hong and H. T. Kung. I/O complexity: The red-blue pebble game. In Proc. 14th STOC, pages 326--333, New York, NY, USA, 1981. ACM.

Digital Library

[27]

S. Hoory, N. Linial, and A. Wigderson. Expander graphs and their applications. Bulletin of the AMS, 43(4):439--561, 2006.

[28]

IBM Blue Gene Team. Overview of the IBM Blue Gene/P project. IBM Journal of Research and Development, 52(1.2):199--220, Jan 2008.

Digital Library

[29]

D. Irony, S. Toledo, and A. Tiskin. Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput., 64(9):1017--1026, 2004.

Digital Library

[30]

J. Jeffers, J. Jeffers, and J. Reinders. Intel Xeon Phi Coprocessor High Performance Programming. Elsevier Science & Technology Books, 2013.

Digital Library

[31]

J. Kim, W. Dally, S. Scott, and D. Abts. Technology-driven, highly-scalable dragonfly topology. In Computer Architecture, 2008. ISCA '08. 35th International Symposium on, pages 77--88, June 2008.

Digital Library

[32]

N. Knight, E. Carson, and J. Demmel. Exploiting data sparsity in parallel matrix powers computations. In Proceedings of PPAM '13, Lecture Notes in Computer Science. Springer (to appear), 2013.

[33]

C. E. Leiserson. Fat-trees: Universal networks for hardware-efficient supercomputing. IEEE Transactions on Computers, C-34(10):892--901, 1985.

Digital Library

[34]

J. H. Lindsey. Assignment of numbers to vertices. The American Mathematical Monthly, 71(5):508--516, 1964.

[35]

W. McColl and A. Tiskin. Memory-efficient matrix multiplication in the BSP model. Algorithmica, 24(3-4):287--297, 1999.

[36]

H. Meuer, E. Strohmaier, J. Dongarra, and H. Simon. Top500 Supercomputer Sites, 2016. www.top500.org.

[37]

J. E. Savage. Extending the Hong-Kung model to memory hierarchies. In COCOON, pages 270--281, 1995.

Digital Library

[38]

A. Schönhage. Partial and total matrix multiplication. SIAM J. Computing, 10(3):434--455, 1981.

[39]

J. Scott, O. Holtz, and O. Schwartz. Matrix multiplication I/O-complexity by path routing. In Proceedings of the 27th ACM symposium on Parallelism in Algorithms and Architectures, pages 35--45. ACM, 2015.

Digital Library

[40]

M. Scquizzato and F. Silvestri. Communication lower bounds for distributed-memory computations. In E. W. Mayr and N. Portier, editors, 31st International Symposium on Theoretical Aspects of Computer Science (STACS 2014), volume 25 of Leibniz International Proceedings in Informatics (LIPIcs), pages 627--638, Dagstuhl, Germany, 2014. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.

[41]

E. Solomonik and J. Demmel. Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In E. Jeannot, R. Namyst, and J. Roman, editors, Euro-Par 2011 Parallel Processing, volume 6853, pages 90--109. Springer Berlin Heidelberg, 2011.

Digital Library

[42]

V. Strassen. Gaussian elimination is not optimal. Numer. Math., 13:354--356, 1969.

Digital Library

[43]

V. Strassen. Relative bilinear complexity and matrix multiplication. Journal fűur die reine und angewandte Mathematik (Crelles Journal), 1987(375--376):406--443, 1987.

Cited By

Bilardi GScquizzato MSilvestri F(2018)A Lower Bound Technique for Communication in BSPACM Transactions on Parallel Computing10.1145/31817764:3(1-27)Online publication date: 20-Feb-2018
https://dl.acm.org/doi/10.1145/3181776

Index Terms

Network topologies and inevitable contention
1. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Computations on matrices

Recommendations

Scalable high-radix router microarchitecture using a network switch organization

As the system size of supercomputers and datacenters increases, cost-efficient networks become critical in achieving good scalability on those systems. High-radix routers reduce network cost by lowering the network diameter while providing a high ...
Read More
Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds
SPAA '12: Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures

A parallel algorithm has perfect strong scaling if its running time on $P$ processors is linear in $1/P$, including all communication costs. Distributed-memory parallel algorithms for matrix multiplication with perfect strong scaling have only recently ...
Read More
Practical Use of Some Krylov Subspace Methods for Solving Indefinite and Nonsymmetric Linear Systems

The main purpose of this paper is to develop stable versions of some Krylov subspace methods for solving linear systems of equations $Ax = b$. As in the case of Paige and Saunders's SYMMLQ [SIAM J. Numer. Anal., 12 (1975), pp. 617–624], our algorithms ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

COM-HPC '16: Proceedings of the First Workshop on Optimization of Communication in HPC

November 2016

89 pages

ISBN:9781509038299

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing
IEEE-CS\DATC: IEEE Computer Society

In-Cooperation

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

IEEE Press

Publication History

Published: 13 November 2016

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC16

Sponsor:

SIGHPC
IEEE-CS\DATC

SC16: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 13 - 18, 2016

Utah, Salt Lake City

Acceptance Rates

Overall Acceptance Rate 7 of 13 submissions, 54%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
94
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)1

Other Metrics

View Author Metrics

Citations

Cited By

Bilardi GScquizzato MSilvestri F(2018)A Lower Bound Technique for Communication in BSPACM Transactions on Parallel Computing10.1145/31817764:3(1-27)Online publication date: 20-Feb-2018
https://dl.acm.org/doi/10.1145/3181776

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents