Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3018058.3018063acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Network topologies and inevitable contention

Published: 13 November 2016 Publication History
  • Get Citation Alerts
  • Abstract

    Network topologies can have significant effect on the execution costs of parallel algorithms due to inter-processor communication. For particular combinations of computations and network topologies, costly network contention may inevitably become a bottleneck, even if algorithms are optimally designed so that each processor communicates as little as possible. We obtain novel contention lower bounds that are functions of the network and the computation graph parameters. For several combinations of fundamental computations and common network topologies, our new analysis improves upon previous per-processor lower bounds which only specify the number of words communicated by the busiest individual processor. We consider torus and mesh topologies, universal fat-trees, and hypercubes; algorithms covered include classical matrix multiplication and direct numerical linear algebra, fast matrix multiplication algorithms, programs that reference arrays, N-body computations, and the FFT. For example, we show that fast matrix multiplication algorithms (e.g., Strassen's) running on a 3D torus will suffer from contention bottlenecks. On the other hand, this network is likely sufficient for a classical matrix multiplication algorithm. Our new lower bounds are matched by existing algorithms only in very few cases, leaving many open problems for network and algorithmic design.

    References

    [1]
    A. Aggarwal, A. K. Chandra, and M. Snir. Communication complexity of PRAMs. Theoretical Computer Science, 71(1):3--28, 1990.
    [2]
    Y. Ajima, S. Sumimoto, and T. Shimizu. Tofu: A 6D mesh/torus interconnect for exascale computers. Computer, 42(11):36--40, Nov 2009.
    [3]
    G. Ballard. Avoiding Communication in Dense Linear Algebra. PhD thesis, EECS Department, University of California, Berkeley, Aug 2013.
    [4]
    G. Ballard, A. Buluç, J. Demmel, L. Grigori, B. Lipshitz, O. Schwartz, and S. Toledo. Communication optimal parallel multiplication of sparse random matrices. In SPAA'13: Proceedings of the 25rd ACM Symposium on Parallelism in Algorithms and Architectures, 2013.
    [5]
    G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz. Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds. In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA '12, pages 77--79, New York, NY, USA, 2012. ACM.
    [6]
    G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz. Graph expansion analysis for communication costs of fast rectangular matrix multiplication. In G. Even and D. Rawitz, editors, Design and Analysis of Algorithms, volume 7659 of Lecture Notes in Computer Science, pages 13--36. Springer Berlin Heidelberg, 2012.
    [7]
    G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Minimizing communication in numerical linear algebra. SIAM Journal on Matrix Analysis and Applications, 32(3):866--901, 2011.
    [8]
    G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Graph expansion and communication costs of fast matrix multiplication. Journal of the ACM, 59(6):32:1--32:23, Dec. 2012.
    [9]
    P. Bay and G. Bilardi. Deterministic on-line routing on area-universal networks. In Proceedings of the 31st Annual Symposium on the Foundations of Computer Science (FOCS), pages 297--306, 1990.
    [10]
    J. Berntsen. Communication efficient matrix multiplication on hypercubes. Parallel Computing, 12(3):335 -- 342, 1989.
    [11]
    G. Bilardi and L. De Stefani. The I/O complexity of strassen's matrix multiplication with recomputation. arXiv preprint arXiv:1605.02224, 2016.
    [12]
    G. Bilardi and F. P. Preparata. Area-time lower-bound techniques with applications to sorting. Algorithmica, 1(1-4):65--91, 1986.
    [13]
    G. Bilardi, M. Scquizzato, and F. Silvestri. A lower bound technique for communication on BSP with application to the FFT. In Euro-Par 2012 Parallel Processing, pages 676--687. Springer, 2012.
    [14]
    L. S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK Users' Guide. SIAM, Philadelphia, PA, USA, May 1997. Also available from http://www.netlib.org/scalapack/.
    [15]
    B. Bollobs and I. Leader. Edge-isoperimetric inequalities in the grid. Combinatorica, 11(4):299--314, 1991.
    [16]
    L. Cannon. A cellular computer to implement the Kalman filter algorithm. PhD thesis, Montana State University, Bozeman, MN, 1969.
    [17]
    E. Chan, M. Heimlich, A. Purkayastha, and R. Van De Geijn. Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience, 19(13):1749--1783, 2007.
    [18]
    D. Chen, N. Eisley, P. Heidelberger, R. Senger, Y. Sugawara, S. Kumar, V. Salapura, D. Satterfield, B. Steinmacher-Burow, and J. Parker. The IBM BG/Q Interconnection Fabric. IEEE Micro, 32(1):32--43, 2012.
    [19]
    M. Christ, J. Demmel, N. Knight, T. Scanlon, and K. Yelick. Communication lower bounds and optimal algorithms for programs that reference arrays - part 1. Technical Report UCB/EECS-2013-61, EECS Department, University of California, Berkeley, 2013.
    [20]
    Cray. Cray XK7 brochure, 2011.
    [21]
    J. Dongarra. Visit to the National University for Defense Technology Changsha, China, June 2013.
    [22]
    M. Driscoll, E. Georganas, P. Koanantakool, E. Solomonik, and K. Yelick. A communication-optimal N-body algorithm for direct interactions. In Proceedings of IPDPS '13, 2013.
    [23]
    F. L. Gall. Powers of tensors and fast matrix multiplication. In Proceedings of the 39th International Symposium on Symbolic and Algebraic Computation (ISSAC 2014), pages 296--303, 2014.
    [24]
    M. T. Goodrich. Communication-efficient parallel sorting. SIAM J. Computing, 29(2):416--432, 1999.
    [25]
    R. I. Greenberg and C. E. Leiserson. Randomized routing on fat-tress. In Proceedings of the 26th Annual Symposium on the Foundations of Computer Science (FOCS), pages 241--249, 1985.
    [26]
    J. W. Hong and H. T. Kung. I/O complexity: The red-blue pebble game. In Proc. 14th STOC, pages 326--333, New York, NY, USA, 1981. ACM.
    [27]
    S. Hoory, N. Linial, and A. Wigderson. Expander graphs and their applications. Bulletin of the AMS, 43(4):439--561, 2006.
    [28]
    IBM Blue Gene Team. Overview of the IBM Blue Gene/P project. IBM Journal of Research and Development, 52(1.2):199--220, Jan 2008.
    [29]
    D. Irony, S. Toledo, and A. Tiskin. Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput., 64(9):1017--1026, 2004.
    [30]
    J. Jeffers, J. Jeffers, and J. Reinders. Intel Xeon Phi Coprocessor High Performance Programming. Elsevier Science & Technology Books, 2013.
    [31]
    J. Kim, W. Dally, S. Scott, and D. Abts. Technology-driven, highly-scalable dragonfly topology. In Computer Architecture, 2008. ISCA '08. 35th International Symposium on, pages 77--88, June 2008.
    [32]
    N. Knight, E. Carson, and J. Demmel. Exploiting data sparsity in parallel matrix powers computations. In Proceedings of PPAM '13, Lecture Notes in Computer Science. Springer (to appear), 2013.
    [33]
    C. E. Leiserson. Fat-trees: Universal networks for hardware-efficient supercomputing. IEEE Transactions on Computers, C-34(10):892--901, 1985.
    [34]
    J. H. Lindsey. Assignment of numbers to vertices. The American Mathematical Monthly, 71(5):508--516, 1964.
    [35]
    W. McColl and A. Tiskin. Memory-efficient matrix multiplication in the BSP model. Algorithmica, 24(3-4):287--297, 1999.
    [36]
    H. Meuer, E. Strohmaier, J. Dongarra, and H. Simon. Top500 Supercomputer Sites, 2016. www.top500.org.
    [37]
    J. E. Savage. Extending the Hong-Kung model to memory hierarchies. In COCOON, pages 270--281, 1995.
    [38]
    A. Schönhage. Partial and total matrix multiplication. SIAM J. Computing, 10(3):434--455, 1981.
    [39]
    J. Scott, O. Holtz, and O. Schwartz. Matrix multiplication I/O-complexity by path routing. In Proceedings of the 27th ACM symposium on Parallelism in Algorithms and Architectures, pages 35--45. ACM, 2015.
    [40]
    M. Scquizzato and F. Silvestri. Communication lower bounds for distributed-memory computations. In E. W. Mayr and N. Portier, editors, 31st International Symposium on Theoretical Aspects of Computer Science (STACS 2014), volume 25 of Leibniz International Proceedings in Informatics (LIPIcs), pages 627--638, Dagstuhl, Germany, 2014. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
    [41]
    E. Solomonik and J. Demmel. Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In E. Jeannot, R. Namyst, and J. Roman, editors, Euro-Par 2011 Parallel Processing, volume 6853, pages 90--109. Springer Berlin Heidelberg, 2011.
    [42]
    V. Strassen. Gaussian elimination is not optimal. Numer. Math., 13:354--356, 1969.
    [43]
    V. Strassen. Relative bilinear complexity and matrix multiplication. Journal fűur die reine und angewandte Mathematik (Crelles Journal), 1987(375--376):406--443, 1987.

    Cited By

    View all
    • (2018)A Lower Bound Technique for Communication in BSPACM Transactions on Parallel Computing10.1145/31817764:3(1-27)Online publication date: 20-Feb-2018

    Index Terms

    1. Network topologies and inevitable contention

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      COM-HPC '16: Proceedings of the First Workshop on Optimization of Communication in HPC
      November 2016
      89 pages
      ISBN:9781509038299

      Sponsors

      In-Cooperation

      Publisher

      IEEE Press

      Publication History

      Published: 13 November 2016

      Check for updates

      Author Tags

      1. FFT
      2. communication costs
      3. communication-avoiding algorithms
      4. matrix multiplication
      5. network topology
      6. numerical linear algebra
      7. strong scaling

      Qualifiers

      • Research-article

      Conference

      SC16
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 7 of 13 submissions, 54%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)6
      • Downloads (Last 6 weeks)1

      Other Metrics

      Citations

      Cited By

      View all
      • (2018)A Lower Bound Technique for Communication in BSPACM Transactions on Parallel Computing10.1145/31817764:3(1-27)Online publication date: 20-Feb-2018

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media